CN115713621A

CN115713621A - Cross-modal image target detection method and device by using text information

Info

Publication number: CN115713621A
Application number: CN202211445740.7A
Authority: CN
Inventors: 孔欧
Original assignee: Shanghai Mdata Information Technology Co ltd
Current assignee: Shanghai Mdata Information Technology Co ltd
Priority date: 2022-11-18
Filing date: 2022-11-18
Publication date: 2023-02-24

Abstract

The invention relates to a method and a device for detecting a cross-mode image target by using text information, wherein the method comprises the following steps: acquiring image data and label information of a target to be identified; detecting all targets in the image data by adopting a candidate frame neural network module, and intercepting to obtain a plurality of interested areas; extracting image features of the region of interest by adopting an image feature extraction neural network module to obtain first features; extracting text features of the label information of the target to be recognized by adopting a text feature extraction neural network module to obtain second features; merging the first characteristic and the second characteristic to obtain a merged matrix; adopting a cross-modal characteristic fusion neural network module to carry out interactive fusion on the characteristics in the merged matrix, and separating the fused matrix into 2 separation matrices; and calculating similarity matrixes of the 2 separation matrixes, and determining the type of target detection according to the similarity matrixes. The invention makes up the defect of detection capability in the open word stock.

Description

Cross-modal image target detection method and device by using text information

Technical Field

The invention relates to the technical field of target detection, in particular to a method and a device for detecting a cross-mode image target by using text information.

Background

The general target detection method has 2 limitations: 1. only by using the modal information of the image, the semantic information of the image cannot be increased by effectively using the text information; 2. if the class of the model in training is only 10 classes, then the detected target is only included in 10 classes in reasoning.

Disclosure of Invention

The invention aims to solve the technical problem of providing a method and a device for detecting a cross-modal image target by using text information, and overcoming the defect of detection capability in an open word bank.

The technical scheme adopted by the invention for solving the technical problem is as follows: the method for detecting the cross-modal image target by using the text information comprises the following steps:

acquiring image data and label information of a target to be identified;

detecting all targets in the image data by adopting a candidate frame neural network module, determining the positions of all the targets, and intercepting all the targets from the image data based on the positions to obtain a plurality of interested areas;

extracting image features of the region of interest by adopting an image feature extraction neural network module to obtain first features;

extracting text features of the label information of the target to be recognized by adopting a text feature extraction neural network module to obtain second features;

merging the first characteristic and the second characteristic to obtain a merged matrix;

performing interactive fusion on the first characteristic and the second characteristic in the merged matrix by adopting a cross-modal characteristic fusion neural network module to obtain a fused matrix, and separating the fused matrix into 2 separation matrices;

and calculating similarity matrixes of the 2 separation matrixes, and determining the type of target detection according to the similarity matrixes.

The candidate box neural network module is a DETR target detection network.

The image feature extraction neural network module is a VIT pre-training model.

The text feature extraction neural network module is a BERT network.

The cross-modal feature fusion neural network module comprises a 3-layer self-attention network layer and a 2-layer full-connection layer which are sequentially arranged, wherein the 3-layer self-attention network layer is used for performing interactive fusion on the first feature and the second feature in the merged matrix and outputting a fusion matrix; the 2 full-link layers are used for separating the fusion matrix into 2 separation matrices.

The technical scheme adopted by the invention for solving the technical problems is as follows: provided is a cross-modal image object detection apparatus using text information, including:

the acquisition module is used for acquiring image data and label information of a target to be identified;

the candidate frame neural network module is used for detecting all targets in the image data, determining the positions of all the targets, and intercepting all the targets from the image data based on the positions to obtain a plurality of interested areas;

the image feature extraction neural network module is used for extracting image features of the region of interest to obtain first features;

the text feature extraction neural network module is used for extracting text features of the label information of the target to be identified to obtain second features;

the merging module is used for merging the first characteristic and the second characteristic to obtain a merging matrix;

the cross-modal feature fusion neural network module is used for interactively fusing the first feature and the second feature in the merged matrix to obtain a fused matrix and separating the fused matrix into 2 separation matrices;

and the classification module is used for calculating the similarity matrix of the 2 separation matrixes and determining the class of target detection according to the similarity matrix.

The classification module comprises: the calculation unit is used for performing matrix multiplication on the two 2 separation matrixes and performing Soft-Max normalization operation on the obtained result to obtain a similarity matrix; and the comparison unit is used for comparing the similarity of the labels in the similarity matrix with a threshold, removing the labels smaller than the threshold, and taking the reserved labels as the target detection categories.

The technical scheme adopted by the invention for solving the technical problems is as follows: there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the cross-modal image object detection method using text information when executing the computer program.

The technical scheme adopted by the invention for solving the technical problem is as follows: there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the above-described cross-modality image object detection method using text information.

Advantageous effects

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: the invention simultaneously utilizes the combined action of the candidate frame neural network module, the image characteristic extraction neural network module, the text characteristic extraction neural network module and the cross-modal characteristic fusion neural network module to increase the semantic information of the image characteristics, and can effectively realize the detection capability of any category without increasing the category training and make up the defect of the detection capability in the open word stock.

Drawings

FIG. 1 is a flow chart of a first embodiment of the present invention;

fig. 2 is a block diagram showing the structure of a second embodiment of the present invention.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

A first embodiment of the present invention relates to a cross-modal image object detection method using text information, as shown in fig. 1, including the steps of:

step 1, acquiring image data and label information of a target to be identified;

and 2, detecting all targets in the image data by adopting a candidate frame neural network module, determining the positions of all the targets, and intercepting all the targets from the image data based on the positions to obtain a plurality of interested areas.

In this step, the candidate box neural network module uses a DETR target detection network, which can locate M targets in the image data and output M pieces of coordinate information, where each piece of coordinate information is a rectangular box containing an x-axis upper left-corner coordinate, a y-axis upper left-corner coordinate, an x-axis lower right-corner coordinate, and a y-axis lower right-corner coordinate of the target. And intercepting a target area in the original input picture through the M pieces of coordinate information, so as to obtain an interested area of the target.

And 3, extracting image features of the region of interest by adopting an image feature extraction neural network module to obtain first features.

In this step, the image feature extraction neural network module uses a VIT pre-training model, which can extract features of the ROI, so that M first features can be obtained, where the M first features form a first feature vector, and the length of the first feature vector is 768dim, which is named as a.

And 4, extracting the text characteristics of the label information of the target to be recognized by adopting a text characteristic extraction neural network module to obtain second characteristics.

In this step, the text feature extraction neural network module adopts a BERT network, and since the text feature extraction neural network module is used to extract text features, when tag information of a target to be recognized is input, the tag information of the target to be recognized needs to be combined into a sentence, for example, the tag information category of the target to be recognized includes: the method comprises the steps that birds, ducks and automobiles, the categories are used as sentences and are input into a text feature extraction neural network module after being segmented, features of the input text can be extracted through the text feature extraction neural network module, N second features form second feature vectors, the length of the second feature vectors is 768dim, and the second feature vectors are named as B.

Step 5, merging the first characteristic and the second characteristic to obtain a merged matrix, namely A + B to obtain the merged matrix with the shape of (M + N, 768);

and 6, interactively fusing the first characteristic and the second characteristic in the merged matrix by adopting a cross-modal characteristic fusion neural network module to obtain a fused matrix, and separating the fused matrix into 2 separating matrixes.

In this step, the cross-modal feature fusion neural network module includes a 3-layer self-attention network layer and a 2-layer full connection layer, which are sequentially arranged. The 3 layers of self-attention network layers are used for interactively fusing the first feature and the second feature in the merged matrix and outputting a fused matrix, namely, the image feature and the text feature are interacted through the characteristics of the self-attention network layers, and the fused matrix is output as (M + N, 768) and named as C; the 2 full link layers are used for separating the fusion matrix into 2 separation matrices, that is, taking the fusion matrix C as the input of the full link layer to obtain a (M + N, 512) matrix, separating the matrix to obtain a (M, 512) separation matrix and a (N, 512) separation matrix, and respectively naming the matrices as D and E.

And 7, calculating similarity matrixes of the 2 separation matrixes, and determining the type of target detection according to the similarity matrixes. Specifically, matrix multiplication is carried out on the separation matrix D and the separation matrix E, and then Soft-Max normalization operation is carried out to obtain a similarity matrix with the shape of (M, N), wherein the similarity matrix represents the similarity of M images and N labels. Assuming that the threshold is 0.5, filtering out class labels corresponding to the similarity smaller than 0.5, retaining the class labels larger than or equal to 0.5, and taking the retained labels as the classes of the final target detection.

The invention can increase the semantic information of the image characteristics by simultaneously utilizing the combined action of the candidate frame neural network module, the image characteristic extraction neural network module, the text characteristic extraction neural network module and the cross-modal characteristic fusion neural network module, can effectively realize the detection capability of any category without increasing the category training, and overcomes the defect of the detection capability of the open word stock.

A second embodiment of the present invention relates to a cross-modal image object detection apparatus using text information, as shown in fig. 2, including:

The candidate box neural network module is a DETR target detection network.

The image feature extraction neural network module is a VIT pre-training model.

The text feature extraction neural network module is a BERT network.

A third embodiment of the present invention relates to an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the cross-modal image object detection method using text information according to the first embodiment when executing the computer program.

A fourth embodiment of the present invention relates to a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the cross-modal image object detection method using text information of the first embodiment.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the invention can be realized by adopting various computer languages, such as object-oriented programming language Java and transliterated scripting language JavaScript.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A cross-modal image target detection method using text information is characterized by comprising the following steps:

acquiring image data and label information of a target to be identified;

2. The method of cross-modal image object detection with textual information of claim 1, wherein the candidate box neural network module is a DETR object detection network.

3. The method for cross-modal image target detection with textual information of claim 1, wherein the image feature extraction neural network module is a VIT pre-training model.

4. The method of claim 1, wherein the text feature extraction neural network module is a BERT network.

5. The method for detecting the trans-modal image target by using the text information according to claim 1, wherein the trans-modal feature fusion neural network module comprises a 3-layer self-attention network layer and a 2-layer full-connection layer which are sequentially arranged, and the 3-layer self-attention network layer is used for performing interactive fusion on the first feature and the second feature in the merged matrix and outputting a fused matrix; the 2 full-link layers are used for separating the fusion matrix into 2 separation matrices.

6. A cross-modal image object detection apparatus using text information, comprising:

the candidate frame neural network module is used for detecting all targets in the image data, determining the positions of all targets, and intercepting all targets from the image data based on the positions to obtain a plurality of interested areas;

a text feature extraction neural network module for extracting text features of the label information of the target to be identified,

obtaining a second characteristic;

7. The device for detecting the target of the cross-modal image by using the text message according to claim 6, wherein the cross-modal feature fusion neural network module comprises a 3-layer self-attention network layer and a 2-layer full-connection layer, which are sequentially arranged, and the 3-layer self-attention network layer is used for performing interactive fusion on the first feature and the second feature in the merged matrix and outputting a fusion matrix; the 2 full link layers are used for separating the fusion matrix into 2 separation matrices.

8. The device of claim 6, wherein the classification module comprises: the calculation unit is used for performing matrix multiplication on the two 2 separation matrixes and performing Soft-Max normalization operation on the obtained result to obtain a similarity matrix; and the comparison unit is used for comparing the similarity of the labels in the similarity matrix with a threshold, removing the labels smaller than the threshold, and taking the reserved labels as the target detection categories.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method for cross-modal image object detection using textual information according to any of claims 1-5 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for cross-modal image object detection using textual information according to any of claims 1-5.