CN115713621A - Cross-modal image target detection method and device by using text information - Google Patents

Cross-modal image target detection method and device by using text information Download PDF

Info

Publication number
CN115713621A
CN115713621A CN202211445740.7A CN202211445740A CN115713621A CN 115713621 A CN115713621 A CN 115713621A CN 202211445740 A CN202211445740 A CN 202211445740A CN 115713621 A CN115713621 A CN 115713621A
Authority
CN
China
Prior art keywords
matrix
neural network
network module
cross
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211445740.7A
Other languages
Chinese (zh)
Inventor
孔欧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Mdata Information Technology Co ltd
Original Assignee
Shanghai Mdata Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Mdata Information Technology Co ltd filed Critical Shanghai Mdata Information Technology Co ltd
Priority to CN202211445740.7A priority Critical patent/CN115713621A/en
Publication of CN115713621A publication Critical patent/CN115713621A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention relates to a method and a device for detecting a cross-mode image target by using text information, wherein the method comprises the following steps: acquiring image data and label information of a target to be identified; detecting all targets in the image data by adopting a candidate frame neural network module, and intercepting to obtain a plurality of interested areas; extracting image features of the region of interest by adopting an image feature extraction neural network module to obtain first features; extracting text features of the label information of the target to be recognized by adopting a text feature extraction neural network module to obtain second features; merging the first characteristic and the second characteristic to obtain a merged matrix; adopting a cross-modal characteristic fusion neural network module to carry out interactive fusion on the characteristics in the merged matrix, and separating the fused matrix into 2 separation matrices; and calculating similarity matrixes of the 2 separation matrixes, and determining the type of target detection according to the similarity matrixes. The invention makes up the defect of detection capability in the open word stock.

Description

Cross-modal image target detection method and device by using text information
Technical Field
The invention relates to the technical field of target detection, in particular to a method and a device for detecting a cross-mode image target by using text information.
Background
The general target detection method has 2 limitations: 1. only by using the modal information of the image, the semantic information of the image cannot be increased by effectively using the text information; 2. if the class of the model in training is only 10 classes, then the detected target is only included in 10 classes in reasoning.
Disclosure of Invention
The invention aims to solve the technical problem of providing a method and a device for detecting a cross-modal image target by using text information, and overcoming the defect of detection capability in an open word bank.
The technical scheme adopted by the invention for solving the technical problem is as follows: the method for detecting the cross-modal image target by using the text information comprises the following steps:
acquiring image data and label information of a target to be identified;
detecting all targets in the image data by adopting a candidate frame neural network module, determining the positions of all the targets, and intercepting all the targets from the image data based on the positions to obtain a plurality of interested areas;
extracting image features of the region of interest by adopting an image feature extraction neural network module to obtain first features;
extracting text features of the label information of the target to be recognized by adopting a text feature extraction neural network module to obtain second features;
merging the first characteristic and the second characteristic to obtain a merged matrix;
performing interactive fusion on the first characteristic and the second characteristic in the merged matrix by adopting a cross-modal characteristic fusion neural network module to obtain a fused matrix, and separating the fused matrix into 2 separation matrices;
and calculating similarity matrixes of the 2 separation matrixes, and determining the type of target detection according to the similarity matrixes.
The candidate box neural network module is a DETR target detection network.
The image feature extraction neural network module is a VIT pre-training model.
The text feature extraction neural network module is a BERT network.
The cross-modal feature fusion neural network module comprises a 3-layer self-attention network layer and a 2-layer full-connection layer which are sequentially arranged, wherein the 3-layer self-attention network layer is used for performing interactive fusion on the first feature and the second feature in the merged matrix and outputting a fusion matrix; the 2 full-link layers are used for separating the fusion matrix into 2 separation matrices.
The technical scheme adopted by the invention for solving the technical problems is as follows: provided is a cross-modal image object detection apparatus using text information, including:
the acquisition module is used for acquiring image data and label information of a target to be identified;
the candidate frame neural network module is used for detecting all targets in the image data, determining the positions of all the targets, and intercepting all the targets from the image data based on the positions to obtain a plurality of interested areas;
the image feature extraction neural network module is used for extracting image features of the region of interest to obtain first features;
the text feature extraction neural network module is used for extracting text features of the label information of the target to be identified to obtain second features;
the merging module is used for merging the first characteristic and the second characteristic to obtain a merging matrix;
the cross-modal feature fusion neural network module is used for interactively fusing the first feature and the second feature in the merged matrix to obtain a fused matrix and separating the fused matrix into 2 separation matrices;
and the classification module is used for calculating the similarity matrix of the 2 separation matrixes and determining the class of target detection according to the similarity matrix.
The cross-modal feature fusion neural network module comprises a 3-layer self-attention network layer and a 2-layer full-connection layer which are sequentially arranged, wherein the 3-layer self-attention network layer is used for performing interactive fusion on the first feature and the second feature in the merged matrix and outputting a fusion matrix; the 2 full-link layers are used for separating the fusion matrix into 2 separation matrices.
The classification module comprises: the calculation unit is used for performing matrix multiplication on the two 2 separation matrixes and performing Soft-Max normalization operation on the obtained result to obtain a similarity matrix; and the comparison unit is used for comparing the similarity of the labels in the similarity matrix with a threshold, removing the labels smaller than the threshold, and taking the reserved labels as the target detection categories.
The technical scheme adopted by the invention for solving the technical problems is as follows: there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the cross-modal image object detection method using text information when executing the computer program.
The technical scheme adopted by the invention for solving the technical problem is as follows: there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the above-described cross-modality image object detection method using text information.
Advantageous effects
Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: the invention simultaneously utilizes the combined action of the candidate frame neural network module, the image characteristic extraction neural network module, the text characteristic extraction neural network module and the cross-modal characteristic fusion neural network module to increase the semantic information of the image characteristics, and can effectively realize the detection capability of any category without increasing the category training and make up the defect of the detection capability in the open word stock.
Drawings
FIG. 1 is a flow chart of a first embodiment of the present invention;
fig. 2 is a block diagram showing the structure of a second embodiment of the present invention.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
A first embodiment of the present invention relates to a cross-modal image object detection method using text information, as shown in fig. 1, including the steps of:
step 1, acquiring image data and label information of a target to be identified;
and 2, detecting all targets in the image data by adopting a candidate frame neural network module, determining the positions of all the targets, and intercepting all the targets from the image data based on the positions to obtain a plurality of interested areas.
In this step, the candidate box neural network module uses a DETR target detection network, which can locate M targets in the image data and output M pieces of coordinate information, where each piece of coordinate information is a rectangular box containing an x-axis upper left-corner coordinate, a y-axis upper left-corner coordinate, an x-axis lower right-corner coordinate, and a y-axis lower right-corner coordinate of the target. And intercepting a target area in the original input picture through the M pieces of coordinate information, so as to obtain an interested area of the target.
And 3, extracting image features of the region of interest by adopting an image feature extraction neural network module to obtain first features.
In this step, the image feature extraction neural network module uses a VIT pre-training model, which can extract features of the ROI, so that M first features can be obtained, where the M first features form a first feature vector, and the length of the first feature vector is 768dim, which is named as a.
And 4, extracting the text characteristics of the label information of the target to be recognized by adopting a text characteristic extraction neural network module to obtain second characteristics.
In this step, the text feature extraction neural network module adopts a BERT network, and since the text feature extraction neural network module is used to extract text features, when tag information of a target to be recognized is input, the tag information of the target to be recognized needs to be combined into a sentence, for example, the tag information category of the target to be recognized includes: the method comprises the steps that birds, ducks and automobiles, the categories are used as sentences and are input into a text feature extraction neural network module after being segmented, features of the input text can be extracted through the text feature extraction neural network module, N second features form second feature vectors, the length of the second feature vectors is 768dim, and the second feature vectors are named as B.
Step 5, merging the first characteristic and the second characteristic to obtain a merged matrix, namely A + B to obtain the merged matrix with the shape of (M + N, 768);
and 6, interactively fusing the first characteristic and the second characteristic in the merged matrix by adopting a cross-modal characteristic fusion neural network module to obtain a fused matrix, and separating the fused matrix into 2 separating matrixes.
In this step, the cross-modal feature fusion neural network module includes a 3-layer self-attention network layer and a 2-layer full connection layer, which are sequentially arranged. The 3 layers of self-attention network layers are used for interactively fusing the first feature and the second feature in the merged matrix and outputting a fused matrix, namely, the image feature and the text feature are interacted through the characteristics of the self-attention network layers, and the fused matrix is output as (M + N, 768) and named as C; the 2 full link layers are used for separating the fusion matrix into 2 separation matrices, that is, taking the fusion matrix C as the input of the full link layer to obtain a (M + N, 512) matrix, separating the matrix to obtain a (M, 512) separation matrix and a (N, 512) separation matrix, and respectively naming the matrices as D and E.
And 7, calculating similarity matrixes of the 2 separation matrixes, and determining the type of target detection according to the similarity matrixes. Specifically, matrix multiplication is carried out on the separation matrix D and the separation matrix E, and then Soft-Max normalization operation is carried out to obtain a similarity matrix with the shape of (M, N), wherein the similarity matrix represents the similarity of M images and N labels. Assuming that the threshold is 0.5, filtering out class labels corresponding to the similarity smaller than 0.5, retaining the class labels larger than or equal to 0.5, and taking the retained labels as the classes of the final target detection.
The invention can increase the semantic information of the image characteristics by simultaneously utilizing the combined action of the candidate frame neural network module, the image characteristic extraction neural network module, the text characteristic extraction neural network module and the cross-modal characteristic fusion neural network module, can effectively realize the detection capability of any category without increasing the category training, and overcomes the defect of the detection capability of the open word stock.
A second embodiment of the present invention relates to a cross-modal image object detection apparatus using text information, as shown in fig. 2, including:
the acquisition module is used for acquiring image data and label information of a target to be identified;
the candidate frame neural network module is used for detecting all targets in the image data, determining the positions of all the targets, and intercepting all the targets from the image data based on the positions to obtain a plurality of interested areas;
the image feature extraction neural network module is used for extracting image features of the region of interest to obtain first features;
the text feature extraction neural network module is used for extracting text features of the label information of the target to be identified to obtain second features;
the merging module is used for merging the first characteristic and the second characteristic to obtain a merging matrix;
the cross-modal feature fusion neural network module is used for interactively fusing the first feature and the second feature in the merged matrix to obtain a fused matrix and separating the fused matrix into 2 separation matrices;
and the classification module is used for calculating the similarity matrix of the 2 separation matrixes and determining the class of target detection according to the similarity matrix.
The candidate box neural network module is a DETR target detection network.
The image feature extraction neural network module is a VIT pre-training model.
The text feature extraction neural network module is a BERT network.
The cross-modal feature fusion neural network module comprises a 3-layer self-attention network layer and a 2-layer full-connection layer which are sequentially arranged, wherein the 3-layer self-attention network layer is used for performing interactive fusion on the first feature and the second feature in the merged matrix and outputting a fusion matrix; the 2 full-link layers are used for separating the fusion matrix into 2 separation matrices.
The classification module comprises: the calculation unit is used for performing matrix multiplication on the two 2 separation matrixes and performing Soft-Max normalization operation on the obtained result to obtain a similarity matrix; and the comparison unit is used for comparing the similarity of the labels in the similarity matrix with a threshold, removing the labels smaller than the threshold, and taking the reserved labels as the target detection categories.
A third embodiment of the present invention relates to an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the cross-modal image object detection method using text information according to the first embodiment when executing the computer program.
A fourth embodiment of the present invention relates to a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the cross-modal image object detection method using text information of the first embodiment.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the invention can be realized by adopting various computer languages, such as object-oriented programming language Java and transliterated scripting language JavaScript.
The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all changes and modifications that fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A cross-modal image target detection method using text information is characterized by comprising the following steps:
acquiring image data and label information of a target to be identified;
detecting all targets in the image data by adopting a candidate frame neural network module, determining the positions of all the targets, and intercepting all the targets from the image data based on the positions to obtain a plurality of interested areas;
extracting image features of the region of interest by adopting an image feature extraction neural network module to obtain first features;
extracting text features of the label information of the target to be recognized by adopting a text feature extraction neural network module to obtain second features;
merging the first characteristic and the second characteristic to obtain a merged matrix;
performing interactive fusion on the first characteristic and the second characteristic in the merged matrix by adopting a cross-modal characteristic fusion neural network module to obtain a fused matrix, and separating the fused matrix into 2 separation matrices;
and calculating similarity matrixes of the 2 separation matrixes, and determining the type of target detection according to the similarity matrixes.
2. The method of cross-modal image object detection with textual information of claim 1, wherein the candidate box neural network module is a DETR object detection network.
3. The method for cross-modal image target detection with textual information of claim 1, wherein the image feature extraction neural network module is a VIT pre-training model.
4. The method of claim 1, wherein the text feature extraction neural network module is a BERT network.
5. The method for detecting the trans-modal image target by using the text information according to claim 1, wherein the trans-modal feature fusion neural network module comprises a 3-layer self-attention network layer and a 2-layer full-connection layer which are sequentially arranged, and the 3-layer self-attention network layer is used for performing interactive fusion on the first feature and the second feature in the merged matrix and outputting a fused matrix; the 2 full-link layers are used for separating the fusion matrix into 2 separation matrices.
6. A cross-modal image object detection apparatus using text information, comprising:
the acquisition module is used for acquiring image data and label information of a target to be identified;
the candidate frame neural network module is used for detecting all targets in the image data, determining the positions of all targets, and intercepting all targets from the image data based on the positions to obtain a plurality of interested areas;
the image feature extraction neural network module is used for extracting image features of the region of interest to obtain first features;
a text feature extraction neural network module for extracting text features of the label information of the target to be identified,
obtaining a second characteristic;
the merging module is used for merging the first characteristic and the second characteristic to obtain a merging matrix;
the cross-modal feature fusion neural network module is used for interactively fusing the first feature and the second feature in the merged matrix to obtain a fused matrix and separating the fused matrix into 2 separation matrices;
and the classification module is used for calculating the similarity matrix of the 2 separation matrixes and determining the class of target detection according to the similarity matrix.
7. The device for detecting the target of the cross-modal image by using the text message according to claim 6, wherein the cross-modal feature fusion neural network module comprises a 3-layer self-attention network layer and a 2-layer full-connection layer, which are sequentially arranged, and the 3-layer self-attention network layer is used for performing interactive fusion on the first feature and the second feature in the merged matrix and outputting a fusion matrix; the 2 full link layers are used for separating the fusion matrix into 2 separation matrices.
8. The device of claim 6, wherein the classification module comprises: the calculation unit is used for performing matrix multiplication on the two 2 separation matrixes and performing Soft-Max normalization operation on the obtained result to obtain a similarity matrix; and the comparison unit is used for comparing the similarity of the labels in the similarity matrix with a threshold, removing the labels smaller than the threshold, and taking the reserved labels as the target detection categories.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method for cross-modal image object detection using textual information according to any of claims 1-5 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for cross-modal image object detection using textual information according to any of claims 1-5.
CN202211445740.7A 2022-11-18 2022-11-18 Cross-modal image target detection method and device by using text information Pending CN115713621A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211445740.7A CN115713621A (en) 2022-11-18 2022-11-18 Cross-modal image target detection method and device by using text information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211445740.7A CN115713621A (en) 2022-11-18 2022-11-18 Cross-modal image target detection method and device by using text information

Publications (1)

Publication Number Publication Date
CN115713621A true CN115713621A (en) 2023-02-24

Family

ID=85233759

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211445740.7A Pending CN115713621A (en) 2022-11-18 2022-11-18 Cross-modal image target detection method and device by using text information

Country Status (1)

Country Link
CN (1) CN115713621A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116416247A (en) * 2023-06-08 2023-07-11 常州微亿智造科技有限公司 Pre-training-based defect detection method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116416247A (en) * 2023-06-08 2023-07-11 常州微亿智造科技有限公司 Pre-training-based defect detection method and device

Similar Documents

Publication Publication Date Title
CN108229478B (en) Image semantic segmentation and training method and device, electronic device, storage medium, and program
Liu et al. Open-world semantic segmentation via contrasting and clustering vision-language embedding
US20180114071A1 (en) Method for analysing media content
EP3926531B1 (en) Method and system for visio-linguistic understanding using contextual language model reasoners
CN112100438A (en) Label extraction method and device and computer readable storage medium
CN111260666B (en) Image processing method and device, electronic equipment and computer readable storage medium
CN111950424A (en) Video data processing method and device, computer and readable storage medium
CN115830471B (en) Multi-scale feature fusion and alignment domain self-adaptive cloud detection method
US11423262B2 (en) Automatically filtering out objects based on user preferences
Alon et al. Deep-hand: a deep inference vision approach of recognizing a hand sign language using american alphabet
CN115713621A (en) Cross-modal image target detection method and device by using text information
CN111368634A (en) Human head detection method, system and storage medium based on neural network
Zhao et al. Bitnet: A lightweight object detection network for real-time classroom behavior recognition with transformer and bi-directional pyramid network
CN113762257A (en) Identification method and device for marks in makeup brand images
Murali et al. Remote sensing image captioning via multilevel attention-based visual question answering
CN116580232A (en) Automatic image labeling method and system and electronic equipment
CN116257609A (en) Cross-modal retrieval method and system based on multi-scale text alignment
CN116977249A (en) Defect detection method, model training method and device
CN116266406A (en) Character coordinate extraction method, device, equipment and storage medium
Pasqualino et al. A multi camera unsupervised domain adaptation pipeline for object detection in cultural sites through adversarial learning and self-training
Ghali et al. CT-Fire: a CNN-Transformer for wildfire classification on ground and aerial images
Hou The application of human detection based on YOLOv5
Cho et al. Design of image generation system for DCGAN-based kids' book text
CN110851349A (en) Page abnormal display detection method, terminal equipment and storage medium
CN110472728B (en) Target information determining method, target information determining device, medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 301AB, No. 10, Lane 198, Zhangheng Road, Free Trade Pilot Zone, Pudong New Area, Shanghai, 200120

Applicant after: Shanghai Mido Technology Co.,Ltd.

Address before: Room 301AB, No. 10, Lane 198, Zhangheng Road, Free Trade Pilot Zone, Pudong New Area, Shanghai, 200120

Applicant before: SHANGHAI MDATA INFORMATION TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information