CN112116074A

CN112116074A - Image description method based on two-dimensional space coding

Info

Publication number: CN112116074A
Application number: CN202010985641.2A
Authority: CN
Inventors: 杨小宝; 武君胜; 屈佳欣; 冯菲蓉
Original assignee: Northwestern Polytechnical University; Xian University of Posts and Telecommunications
Current assignee: Northwestern Polytechnical University; Xian University of Posts and Telecommunications
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2020-12-22
Anticipated expiration: 2040-09-18
Also published as: CN112116074B

Abstract

The invention relates to the technical field of image description, and discloses an image description method based on two-dimensional space coding, which comprises the following steps: s1, sending an image into an encoder model for image feature extraction to obtain a corresponding two-dimensional feature map; s2, encoding absolute position information for the two-dimensional characteristic diagram through sequential position encoding, coordinate position encoding or target-level position encoding; s3, converting the two-dimensional characteristic diagram into a one-dimensional sequence which can be identified by a decoder according to the absolute position information of the two-dimensional characteristic diagram.

Description

Image description method based on two-dimensional space coding

Technical Field

The invention relates to the technical field of image description, in particular to an image description method based on two-dimensional space coding.

Background

With the continuous development of artificial intelligence, computer vision becomes one of the most important research directions in the field of artificial intelligence, and since the training method for deep neural networks was published in science by Hinton professor in 2006, the vigorous development of deep learning is achieved, so that computer vision based on deep learning becomes the most active field of artificial intelligence at present. Vision technology not only requires that computers can "observe" things instead of human eyes, but also have the ability to "understand" things like the human brain, and its challenge is to develop computers and robots with visual abilities comparable to the human level, so that people can be helped to handle some complex technical applications. At present, computer vision based on deep learning is widely applied to various industries, including fields such as intelligent medical treatment, public security, unmanned aerial vehicle and automatic driving, and quality detection, crop identification and grading, quality detection, landmark tracking and the like of agricultural products all provide more and more convenience for human life. Image description (image capturing) is a comprehensive research direction combining computer vision, natural language processing and machine learning, which is similar to translating a picture into a descriptive text, and the task is very easy for human beings but very challenging for machines, and it not only needs to utilize a model to understand the content of the picture, but also needs to use natural language to express semantic relationship between the picture and the descriptive text, so that it is also the important point and difficulty of current artificial intelligence field research across disciplines. The image description means that given an image, not only what objects are on the image but also the interrelation between the objects is understood through a corresponding algorithm, and finally the description is described by using characters, which is similar to the 'talking in the picture' problem of pupils. With the rise of machine translation and big data, the research surge of Image Caption appears. Most of the current Image capture methods are based on an encoder-decoder model, wherein the encoder is generally a Convolutional Neural Network (CNN), the characteristics of the last fully-connected layer or convolutional layer are used as the characteristics of an Image, and the decoder is generally a Recurrent Neural Network (RNN) and is mainly used for generating Image description. At present, most of the famous teams at home and abroad can carry out deep research on the model of 'encoder-decoder' when improving the image description task. In order to better obtain high-level semantic information of the image, the original convolutional neural network is improved, and the extraction of the image characteristics at the encoder stage is enhanced; the original recurrent neural network is improved according to the inspiration of the machine translation field, so that the language expression capability of the decoder model is more accurate and richer.

In the field of computer vision, inputting the inherent geometric position structure between each target in an image is beneficial to reasoning visual information, and has a crucial influence on the related tasks of image understanding, for example, for two targets in an image, if the relative position relationship between the two targets is known, the understanding of a computer on the whole image can be further improved, so that richer image features are extracted. For image description, the relative position relation of pictures is not changed through the process of extracting features of the pictures by an encoder, so that the general encoder work does not specially add corresponding spatial position information to each pixel in the pictures. The image description is originally a research in a cross-field mode, a two-dimensional feature map is required to be converted into a one-dimensional feature sequence representation which can be identified by a decoder, but the feature map is not specially marked with spatial position information, and after the feature map is converted into the one-dimensional sequence, the spatial position relation of each original pixel point in the image is disturbed, so that the position information of the image is lost.

Disclosure of Invention

The invention provides an image description method based on two-dimensional space coding, which can solve the problem of image space position information loss.

The invention provides an image description method based on two-dimensional space coding, which comprises the following steps:

s1, sending an image into an encoder model for image feature extraction to obtain a corresponding two-dimensional feature map;

s2, encoding absolute position information for the two-dimensional characteristic diagram through sequential position encoding, coordinate position encoding or target-level position encoding;

and S3, converting the two-dimensional feature map into a one-dimensional sequence which can be recognized by a decoder according to the absolute position information of the two-dimensional feature map.

The sequential position coding and coordinate position coding in step S2 are used for the image-level image description and the attribute-level image description, and the target-level position coding is used for the target-level image description.

The image level image description and the attention level image description adopt an EfficientNet coder, and the target level image description adopts a Faster R-CNN coder.

The sequential position encoding in step S2 includes the steps of:

s21, in the process of converting the two-dimensional characteristic diagram into the one-dimensional sequence, the decoder model has the characteristics of reading pixel point information according to lines, the size of the input characteristic diagram is m x n, m x n pixel points in the two-dimensional characteristic diagram are sequentially coded according to the lines, and line position information, namely visual information V, of each pixel point of the two-dimensional characteristic diagram is obtained₀、V₁、V₂…V_(m*n)-2、V_(m*n)-1；

S22, according to the pixels in the two-dimensional feature map encoding the row position information, each row of pixels has n pixels from 0 to i-1, where i is 0, 1, …, and n-1, for the n pixels in the first row, when i is 0, the row extraction is started for the pixel with the row position information of 0 in the first row, and m times of extraction is performed according to i + j × n, j is 0, 1, …, and m-1, so as to obtain all the pixels in the 1 st row; when i is 1, performing row-column extraction on the pixels with row position information of 1 in the first row, then performing m-time extraction according to i + j × n to obtain all pixel points in the 2 nd column, and performing m-time extraction on all the pixel points from 0 to i-1 in the first row according to i + j × n by analogy, and finally obtaining the column position information of a characteristic diagram with the size of m × n;

and S23, obtaining absolute position information of the two-dimensional feature map according to the row position information and the column position information of each pixel point of the two-dimensional feature map.

The coordinate position encoding in step S2 includes the steps of:

s21, performing 0-i row position coding on the (i +1) × (i +1) feature graph output by the encoder, performing 0-i coding on i +1 pixel points in the first row, performing 0-i coding on i +1 pixel points in the second row, and so on, encoding 0-i on i +1 pixel points in each row to obtain row position information of the two-dimensional feature graph;

s22, transposing the two-dimensional feature map with the row position codes to obtain column position information of corresponding pixel points in the two-dimensional feature map;

The target level position encoding in step S2 includes the following steps:

s21, removing the target in the characteristic diagram through the anchor frame by the encoder model Faster R-CNN, calculating whether the anchor frame reaches the target or not and the category of the target, wherein the confidence degree represents the probability of the target;

s22, extracting a characteristic sequence of each target arranged according to the confidence degree through an encoder model Faster R-CNN;

s23, according to the feature sequences arranged according to the confidence degrees of each target, coding corresponding absolute position information for each target object by means of the position information of the internal center coordinates of the coder model Faster R-CNN, and obtaining the absolute position information of the feature map.

The specific process of encoding the corresponding absolute position information for each target object in step S23 includes the following steps:

s231, firstly, calculating the area of a rectangular frame according to the coordinate information of each target rectangular frame in the target characteristic diagram, and calculating the contact ratio between every two areas of a plurality of target rectangular frames by using an intersection ratio function IOU;

s232, calculating the positions of central points corresponding to the targets according to the position coordinates of the rectangular frame, and reversely mapping the positions of the central points to corresponding pixel points of a characteristic diagram, namely a convolution characteristic diagram, output by the encoder;

and S233, carrying out position coding on the corresponding pixel points to obtain the spatial position information of a plurality of targets, so that the spatial position information of each target is added to the one-dimensional visual sequence output by the coder and arranged with confidence.

The intersection-and-proportion function IOU in step S231 is a method of calculating the degree of coincidence between the rectangular frame area1 of the target 1 and the rectangular frame area2 of the target 2, and is defined as:

IOU＝area/(area1+area2-area)

the fact that the intersection ratio function IOU value is large means that the coincidence degree between the two targets is high, and if the intersection ratio function IOU value is small means that the two targets do not coincide.

Compared with the prior art, the invention has the beneficial effects that:

the invention converts the two-dimensional characteristic diagram into the one-dimensional sequence and then stores the position information of the original image, so that the one-dimensional sequence with the image position information can rely on word representation and comprehensive guidance of visual information after entering the decoder, and compared with the method that the one-dimensional sequence is not added with the image position information at the present stage, the image description effect is good.

Drawings

Fig. 1 is a flow chart illustrating an image of a one-dimensional sequence feature without spatial position information and with line priority according to an embodiment of the present invention.

FIG. 2 is a flowchart illustrating an image description of sequentially encoding row and column information according to an embodiment of the present invention.

FIG. 3 is a flowchart illustrating an image description of coordinate position encoding row and column information according to an embodiment of the present invention.

Fig. 4 is a flowchart of image description without adding target location information according to an embodiment of the present invention.

FIG. 5 is a flowchart illustrating a target-level image with spatial location information according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of calculating the coincidence ratio of two rectangular boxes area1 (target a) and area2 (target B) according to the embodiment of the present invention.

Fig. 7 is a flowchart of an image description method based on two-dimensional space encoding according to an embodiment of the present invention.

Detailed Description

An embodiment of the present invention will be described in detail below with reference to fig. 1-7, but it should be understood that the scope of the present invention is not limited to the embodiment.

Feature map position encoding scheme for Image description (Image capture):

we use the "encoder-decoder" architecture most commonly used in the field of image description. The encoder uses the "EfficientNet" model or the "Faster R-CNN" model, and the decoder uses the basic parallel language generating model "based transformer". To explore a method for encoding position information for each pixel in a feature map, for convenience of description, we first discuss how position information is lost when a conventional image description task converts (reshape) a two-dimensional feature map into a one-dimensional sequence (as shown in fig. 1).

Firstly, an image with 512 × 512 size is sent to an encoder (EfficientNet) model for feature extraction of the image to obtain a corresponding feature map, at this time, the dimension of the feature map is [1,512,8,8], wherein the size of the image batch input to the encoder model is 1,512 refers to the number of channels of the convolution layer, two 8 refer to the size of the image length and width, and the image has 8 × 8 ═ 64 pixel points in total. Because the image has the characteristic of space position invariance after passing through the CNN encoder, the original image is converted into the feature map without losing the position information of each original pixel point, the two-dimensional feature map is processed (reshape) and then is changed into a one-dimensional sequence which can be identified by the decoder, the one-dimensional sequence generated by the reshape keeps batch size and channel number unchanged, only the pixel points of the two-dimensional image 8 x 8 are flattened into the one-dimensional sequence of 64 pixel points, and because the absolute position information of the feature map is not coded in the processing (reshape) process, the original position information of each pixel point of the feature map is lost after the two-dimensional feature map is converted into the one-dimensional sequence, so that some characteristics of the image can be lost, and the description quality of a decoder model is low.

The invention adds corresponding position coding information to the pixel point of each characteristic diagram, and the difference between the position information added by the one-dimensional sequence and the position information added by the one-dimensional sequence is that the two-dimensional position information of the image needs to code the Line position (Line Embedding) and the Column position (Column Embedding) of the image respectively. The feature map position coding method provided by the invention can exactly solve the problem that the position information of each pixel point is lost after the two-dimensional feature map is converted into the one-dimensional sequence, so that the converted one-dimensional sequence can also keep the original position information relationship of the image pixel points. Since image descriptions are divided into three types: the image level, the attribute level and the target level, and the image description of the target level is greatly different from the other two levels, and the image description has no sequential position relation information of the images and only has the position relation of the image target. Therefore, we propose three methods for feature map position coding: sequential position coding, coordinate position coding and target level position coding:

(1) sequential position coding: s21, in the process of converting the two-dimensional characteristic diagram into the one-dimensional sequence, the decoder model has the characteristics of reading pixel point information according to lines, the size of the input characteristic diagram is m x n, m x n pixel points in the two-dimensional characteristic diagram are sequentially coded according to the lines, and line position information, namely visual information V, of each pixel point of the two-dimensional characteristic diagram is obtained₀、V₁、V₂…V_(m*n)-2、V_(m*n)-1；

To illustrate this methodIn the method, m × n is 8 × 8,8 pixels are arranged in each row, extraction is performed for 8 times according to i + j × 8, where i is 0 to 7, and j is 0 to 7, and since there are 8 rows in total, extraction is sequentially performed for 8(0 to 7) pixels in the first row, when i is 0, extraction is performed for 8 times according to i +8 j, and j is 0 to 7 for 0 to 0 position in the first row, and when j is 0 for the first time, extraction is performed for i +8 × j being 0+8 + 0 to obtain P₀Position information, i +8 j-0 +8 j-1 is extracted to obtain P when j is 1 for the second time₀Position lower P₈The position information of (2) is extracted for 8 times by analogy in turn, and finally the position information P in the first column is obtained₀、P₈、P₁₆、P₂₄、P₃₂、P₄₀、P₄₈、P₅₆(ii) a Extracting the position information P of the second row from the position i to 1 according to the value of i +8 × j, j for 0-7 times₁、P₉、P₁₇、P₂₅、P₃₃、P₄₁、P₄₉、P₅₇And by analogy, extracting 8 times for each point of the first row 0-7 according to i +8 x j to finally obtain the column position information of the feature map.

Therefore, the row and column position information of each pixel point of the feature map can be obtained, and reshape conversion is carried out on the feature map with the two-dimensional space position information, so that the space position information of the image can be prevented from being lost.

(2) And (3) coordinate position coding: because each pixel point of the characteristic diagram needs to show the position information of rows and columns, we can think that the position of the pixel point is shown by a two-dimensional coordinate, the X axis of the coordinate shows the row position information of the pixel point, and the Y axis shows the column position information of the pixel point, so that corresponding space position information can be coded for each pixel point of the image, and the space position information of the original image is also stored after the two-dimensional characteristic diagram is converted into a one-dimensional sequence. As shown in fig. 3, firstly, the 8 × 8 feature map output by the encoder (EfficientNet) is subjected to line position encoding of 0 to 7, for example: coding 0-7 points of 8 pixels in the first row, and recording as (P)₀、P₁、P₂、P₃、P₄、P₅、P₆、P₇) And 8 pixel points in the second row are compiled from 0 to 7And coding 0-7 for 8 pixel points in each Line by analogy, thus completing Line position coding (Line Embedding) of the characteristic diagram. Next, transposing the feature map with row position codes to obtain Column position information (Column Embedding) of corresponding pixel points, such as converting into one-dimensional visual sequence (V) in fig. 3₀、V₁、V₂…V₆₂、V₆₃) Corresponding row position and column position information are added, so that the space position information of the original image can be stored after the characteristic diagram marked with the row and column position information is converted into a one-dimensional sequence.

(3) Position coding of the target level: as described above, there are three types of image description tasks: an image level, an attention level, and a target level. The first two position information coding modes are suitable for image level and attribute level image description and are not suitable for target level image description tasks, because a target level image description encoder uses a classic model for target detection, the sequence feature sequence of the whole image is not extracted through the encoder, but the feature sequence of a significant target region in the image, and the EfficientNet encoder used in the first two methods cannot extract the features of the target region, so the encoder needs to be changed into a fast R-CNN model before adding position information to the target level image description tasks. The image description of the target level is extracted by an encoder Faster R-CNN as a characteristic sequence of each target arranged according to confidence coefficient: the model generates an anchor frame, then frames the target in the image, and the model calculates whether the anchor frame frames the target and which class the framed target belongs to, and the confidence level is to indicate how likely the framed target belongs to this class, as shown in fig. 4: the confidence coefficient of water is 0.99, the confidence coefficient of ducks is 0.92, and the confidence coefficient of snow mountains is 0.80, so that the characteristic diagrams obtained 7 x 7 are water (C), ducks (B) and snow mountains (A) in sequence, and similarly, the one-dimensional visual characteristic sequence obtained by each target does not have corresponding spatial position information. Because each target rectangular frame (rectangle box) in the fast R-CNN model has coordinate information of four frame corners and length and width information (x, y, w, h) of the frame, and is used to obtain a rectangular frame with higher accuracy (such as the target feature map in fig. 5) in the process of generating the rpn (regional pro-potential network), we can encode corresponding absolute position information for each target object by means of the center coordinate position information in the target-level encoder model, so that the fast R-CNN model outputs a one-dimensional visual sequence arranged according to confidence level and adds corresponding spatial position information. As shown in fig. 5, we do this: first, the area of the rectangular frame is calculated according to the coordinate information of each target rectangular frame in the target feature map, and the coincidence degree between every two rectangular frame areas of three targets (A, B, C) is calculated by using an IOU method, wherein the IOU is a method for calculating the coincidence degree of two rectangular frames area1 (target A) and area2 (target B), and is defined as shown in FIG. 6:

IOU＝area/(area1+area2-area)

the IOU value is very large, so that the coincidence degree between the two targets is very high, if the IOU value is very small, so that the IOU of the target A snow mountain and the target B duck is very small, the IOU of the target B duck and the target C water is very large, and the IOU of the target A snow mountain and the target C water is very small, so that the position relation between the three targets is obtained: the snow mountain is independent and has no overlapping relation with the ducks and the water, and the high IOU value of the ducks and the water indicates that the ducks are contained in the water; secondly, calculating the corresponding central point positions of the three targets according to the position coordinates of the rectangular frame, reversely mapping the central point positions to corresponding pixel points of a convolution feature map, as shown by ABC mark points of the three targets on the convolution feature map in FIG. 5, wherein the central point position of the snow mountain of the point A is on the 11 th pixel point, the central point position of the duck of the point B is on the 27 th pixel point, the central point position of the water of the point C is on the 43 th pixel point, and simultaneously carrying out position coding on the corresponding pixel points to obtain the spatial position information of the three targets, so that the spatial position information of each target is added to a one-dimensional visual sequence output by an encoder and arranged with confidence.

The invention aims to solve the problem of image pixel point position information loss in an image description algorithm, and provides three image position coding methods which are all used for adding absolute position information to an output characteristic diagram of an encoder.

For an image description task, the design and selection of an encoder and a decoder are crucial to the overall description effect of the task, but existing models ignore lost position information when a two-dimensional characteristic image is converted into a one-dimensional sequence, the one-dimensional sequence without the image position information can only depend on a word representation method after entering the decoder, and the guidance of visual information is lacked, so three image position information coding methods are provided, image descriptions at an image level and an attribute level can optionally adopt a sequential position coding and coordinate position coding method, and an image description task at a target level adopts a target position coding scheme.

The image description method of sequential position coding or coordinate position coding focuses on the whole information of an image in the whole process, if a local individual region is shielded, the characteristics of detection, matching and the like of other features cannot be influenced due to the disappearance of the local features, the whole description effect of image description is not greatly influenced, although the text semantics of description is probably not rich, the semantic integrity of image description is good.

According to the image description method of the target position coding, the whole information of an image is paid attention to in the whole process, and then the characteristic extraction is carried out on the salient target area of the image. Because the description effect of each target information in the image on the decoding part is greatly influenced, the text generated by the image description task of the target position coding method not only has good semantic integrity but also has higher semantic richness. The overall description effect is better than the image description method of the sequential position coding and the attribute position coding.

No matter which position coding method is adopted, the two-dimensional characteristic diagram can be converted into a one-dimensional sequence, and the position information of the original image is also stored, so that the one-dimensional sequence with the image position information can be guided comprehensively by a word representation method and visual information after entering a decoder, and the description effect is better than that of the method that the image position information is not added to the one-dimensional sequence at the present stage.

The above disclosure is only for a few specific embodiments of the present invention, however, the present invention is not limited to the above embodiments, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.

Claims

1. An image description method based on two-dimensional space coding is characterized by comprising the following steps:

2. The image description method based on two-dimensional space coding according to claim 1, wherein the sequential position coding and coordinate position coding in step S2 are used for image-level image description and attribute-level image description, and the target-level position coding is used for target-level image description.

3. The method as claimed in claim 2, wherein the image description and the attribute-level image description are encoded by using an EfficientNet encoder, and the target-level image description is encoded by using a Faster R-CNN encoder.

4. The image description method based on two-dimensional space coding according to claim 1, wherein said sequential position coding in step S2 comprises the steps of:

5. The image description method based on two-dimensional space encoding according to claim 1, wherein said coordinate position encoding in step S2 includes the steps of:

6. The image description method based on two-dimensional space encoding according to claim 2, wherein said target-level position encoding in step S2 includes the steps of:

7. The image description method based on two-dimensional space encoding as claimed in claim 6, wherein the specific process of encoding the corresponding absolute position information for each target object in step S23 includes the following steps:

8. The image description method based on two-dimensional space encoding of claim 7, wherein the cross-over ratio function IOU in step S231 is a method for calculating the coincidence ratio between the rectangular frame area1 of the object 1 and the rectangular frame area2 of the object 2, and is defined as:

IOU＝area/(area1+area2-area)