CN117892123A - Multi-mode target detection method and device - Google Patents

Multi-mode target detection method and device Download PDF

Info

Publication number
CN117892123A
CN117892123A CN202311727983.4A CN202311727983A CN117892123A CN 117892123 A CN117892123 A CN 117892123A CN 202311727983 A CN202311727983 A CN 202311727983A CN 117892123 A CN117892123 A CN 117892123A
Authority
CN
China
Prior art keywords
feature
decoding
loss
detection
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311727983.4A
Other languages
Chinese (zh)
Inventor
石雅洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Xumi Yuntu Space Technology Co Ltd
Original Assignee
Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xumi Yuntu Space Technology Co Ltd filed Critical Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority to CN202311727983.4A priority Critical patent/CN117892123A/en
Publication of CN117892123A publication Critical patent/CN117892123A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The disclosure provides a multi-mode target detection method and device. The method comprises the following steps: inputting each set of training data into a multi-modal target detection model: processing two images and text descriptions through an image coding network and a text coding network respectively to obtain a first image feature, a second image feature and a text feature; processing the first image features and the text features through a multi-mode decoding network to obtain first decoding features; processing the first decoding characteristic through a feedforward network to obtain a first detection result; processing the second image feature and the text feature through a multi-mode coding network to obtain coding features; determining a second decoding feature through the multi-modal decoding network based on the first decoding feature, the text feature, and the encoding feature; processing the second decoding characteristic through a feedforward network to obtain a second detection result; calculating a loss based on the first image feature, the second image feature, the text feature, and the first and second detection results; and optimizing the multi-mode target detection model according to the loss.

Description

Multi-mode target detection method and device
Technical Field
The disclosure relates to the technical field of target detection, and in particular relates to a multi-mode target detection method and device.
Background
In recent years, detection models based on multi-modal encoders have attracted widespread attention in academia and industry due to their excellent performance in various cross-modal tasks. The model can learn the cross-modal characteristics by utilizing the image-text pairs, so that rich characteristics can be learned, and the accuracy of detection results is improved. However, the relationships between the modes are often learned through images with a single view angle, which easily causes misjudgment of detection results of small target objects or fuzzy objects, and in complex scenes, the images with the single view angle are learned, so that robust cross-mode characterization is difficult to learn.
Disclosure of Invention
In view of this, the embodiments of the present disclosure provide a multi-modal target detection method, apparatus, electronic device, and computer readable storage medium, so as to solve the problems in the prior art that in multi-modal detection of images and texts, image learning at a single view angle easily causes misjudgment of detection results of small target objects or fuzzy objects, and robust cross-modal characterization is difficult to learn in complex scenes.
In a first aspect of an embodiment of the present disclosure, a method for detecting a multi-modal target is provided, including: constructing a multi-mode target detection model by using an image coding network, a text coding network, a multi-mode decoding network and a feedforward network; acquiring a training data set, wherein the training data set comprises a plurality of groups of training data, and each group of training data comprises two images of different visual angles of the same object and text descriptions corresponding to the two images; inputting each set of training data into a multi-modal target detection model: processing two images and text descriptions through an image coding network and a text coding network respectively to obtain a first image feature, a second image feature and a text feature; processing the first image features and the text features through a multi-mode decoding network to obtain first decoding features; processing the first decoding characteristic through a feedforward network to obtain a first detection result; processing the second image feature and the text feature through a multi-mode coding network to obtain coding features; determining a second decoding feature through the multi-modal decoding network based on the first decoding feature, the text feature, and the encoding feature; processing the second decoding characteristic through a feedforward network to obtain a second detection result; calculating feature contrast loss based on the first image feature, the second image feature and the text feature, and respectively calculating detection loss corresponding to the first detection result and the second detection result; and optimizing model parameters of the multi-modal target detection model according to the characteristic contrast loss and the detection loss so as to complete training of the multi-modal target detection model.
In a second aspect of embodiments of the present disclosure, there is provided a multi-modal object detection apparatus, including: a construction module configured to construct a multi-modal object detection model using the image encoding network, the text encoding network, the multi-modal decoding network, and the feed-forward network; the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is configured to acquire a training data set, wherein the training data set comprises a plurality of groups of training data, and each group of training data comprises two images of different visual angles of the same object and text descriptions corresponding to the two images; a first processing module configured to input each set of training data into a multi-modal object detection model: processing two images and text descriptions through an image coding network and a text coding network respectively to obtain a first image feature, a second image feature and a text feature; the second processing module is configured to process the first image feature and the text feature through the multi-mode decoding network to obtain a first decoding feature; the third processing module is configured to process the first decoding characteristic through the feedforward network to obtain a first detection result; the fourth processing module is configured to process the second image feature and the text feature through the multi-mode coding network to obtain coding features; a fifth processing module configured to determine a second decoding feature through the multi-modal decoding network based on the first decoding feature, the text feature, and the encoding feature; a sixth processing module configured to process the second decoding feature through the feed-forward network to obtain a second detection result; the computing module is configured to compute feature contrast loss based on the first image feature, the second image feature and the text feature, and respectively compute detection loss corresponding to the first detection result and the second detection result; and the optimizing module is configured to optimize model parameters of the multi-mode target detection model according to the characteristic contrast loss and the detection loss so as to complete training of the multi-mode target detection model.
In a third aspect of the disclosed embodiments, an electronic device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
In a fourth aspect of the disclosed embodiments, a computer-readable storage medium is provided, which stores a computer program which, when executed by a processor, implements the steps of the above-described method.
Compared with the prior art, the embodiment of the disclosure has the beneficial effects that: because the disclosed embodiments construct a multi-modal object detection model by utilizing an image encoding network, a text encoding network, a multi-modal decoding network, and a feed-forward network; acquiring a training data set, wherein the training data set comprises a plurality of groups of training data, and each group of training data comprises two images of different visual angles of the same object and text descriptions corresponding to the two images; inputting each set of training data into a multi-modal target detection model: processing two images and text descriptions through an image coding network and a text coding network respectively to obtain a first image feature, a second image feature and a text feature; processing the first image features and the text features through a multi-mode decoding network to obtain first decoding features; processing the first decoding characteristic through a feedforward network to obtain a first detection result; processing the second image feature and the text feature through a multi-mode coding network to obtain coding features; determining a second decoding feature through the multi-modal decoding network based on the first decoding feature, the text feature, and the encoding feature; processing the second decoding characteristic through a feedforward network to obtain a second detection result; calculating feature contrast loss based on the first image feature, the second image feature and the text feature, and respectively calculating detection loss corresponding to the first detection result and the second detection result; and optimizing model parameters of the multi-modal target detection model according to the characteristic contrast loss and the detection loss so as to complete training of the multi-modal target detection model. By adopting the technical means, the problems that in the prior art, in multi-mode detection of images and texts, misjudgment is easily caused by single-view image learning, a detection result of a small target object or a fuzzy object is easy to generate, and robust cross-mode characterization is difficult to learn under a complex scene are solved, so that the accuracy rate of detecting the small target object or the fuzzy object is improved, and the robust cross-mode characterization is ensured to be learned under the complex scene.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required for the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.
Fig. 1 is a schematic flow chart of a multi-mode target detection method according to an embodiment of the disclosure;
FIG. 2 is a flow chart of another multi-mode target detection method according to an embodiment of the present disclosure
FIG. 3 is a schematic structural diagram of a multi-modal object detection apparatus according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.
A multi-modal object detection method and apparatus according to embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.
Fig. 1 is a flow chart of a multi-mode target detection method according to an embodiment of the disclosure. The multi-modal object detection method of fig. 1 may be performed by a computer or server, or software on a computer or server. As shown in fig. 1, the multi-mode target detection method includes:
s101, constructing a multi-mode target detection model by using an image coding network, a text coding network, a multi-mode decoding network and a feedforward network;
s102, acquiring a training data set, wherein the training data set comprises a plurality of groups of training data, and each group of training data comprises two images of different visual angles of the same object and text descriptions corresponding to the two images;
s103, inputting each group of training data into a multi-modal target detection model: processing two images and text descriptions through an image coding network and a text coding network respectively to obtain a first image feature, a second image feature and a text feature;
s104, processing the first image features and the text features through a multi-mode decoding network to obtain first decoding features;
s105, processing the first decoding characteristic through a feedforward network to obtain a first detection result;
s106, processing the second image feature and the text feature through a multi-mode coding network to obtain coding features;
s107, determining a second decoding feature through the multi-mode decoding network based on the first decoding feature, the text feature and the encoding feature;
s108, processing the second decoding characteristic through a feedforward network to obtain a second detection result;
s109, calculating feature contrast loss based on the first image feature, the second image feature and the text feature, and respectively calculating detection loss corresponding to the first detection result and the second detection result;
s110, optimizing model parameters of the multi-mode target detection model according to the characteristic contrast loss and the detection loss so as to complete training of the multi-mode target detection model.
The Image encoding network and the text encoding network may employ a multi-modal network (CLIP, contrastive Language-Image Pre-translation), the multi-modal encoding network may employ a fransfomer encoder, the multi-modal decoding network may employ a fransfomer decoder, and the feed forward network (FFN, feedforward neural network) may be employed as the multi-modal encoder. The text description corresponding to the two images is a description about information in the two images, including information of objects in the two images and information of scenes.
According to the technical scheme provided by the embodiment of the application, a multi-mode target detection model is constructed by utilizing an image coding network, a text coding network, a multi-mode decoding network and a feedforward network; acquiring a training data set, wherein the training data set comprises a plurality of groups of training data, and each group of training data comprises two images of different visual angles of the same object and text descriptions corresponding to the two images; inputting each set of training data into a multi-modal target detection model: processing two images and text descriptions through an image coding network and a text coding network respectively to obtain a first image feature, a second image feature and a text feature; processing the first image features and the text features through a multi-mode decoding network to obtain first decoding features; processing the first decoding characteristic through a feedforward network to obtain a first detection result; processing the second image feature and the text feature through a multi-mode coding network to obtain coding features; determining a second decoding feature through the multi-modal decoding network based on the first decoding feature, the text feature, and the encoding feature; processing the second decoding characteristic through a feedforward network to obtain a second detection result; calculating feature contrast loss based on the first image feature, the second image feature and the text feature, and respectively calculating detection loss corresponding to the first detection result and the second detection result; and optimizing model parameters of the multi-modal target detection model according to the characteristic contrast loss and the detection loss so as to complete training of the multi-modal target detection model. By adopting the technical means, the problems that in the prior art, in multi-mode detection of images and texts, misjudgment is easily caused by single-view image learning, a detection result of a small target object or a fuzzy object is easy to generate, and robust cross-mode characterization is difficult to learn under a complex scene are solved, so that the accuracy rate of detecting the small target object or the fuzzy object is improved, and the robust cross-mode characterization is ensured to be learned under the complex scene.
Further, processing the first image feature and the text feature through the multi-modal decoding network to obtain a first decoded feature, including: taking the text feature as a first query vector, and taking the first image feature as a first key vector and a first value vector; the first query vector, the first key vector and the first value vector are input into a multi-mode decoding network, and the first decoding feature is output.
It should be noted that the Query vector, the Key vector, and the Value vector in the present application are a Q (Query) vector, a K (Key) vector, and a V (Value) vector, respectively, in the self-attention mechanism.
The first key vector and the first value vector are the same, are both first image features, and the first query vector is a text feature. The first query vector, the first key vector and the first value vector are taken as a group of vectors (Q, K, V), the multi-mode decoding network (Q, K, V) obtains a first decoding characteristic, and the (Q, K, V) represents a vector group.
Further, processing the second image feature and the text feature through the multi-modal encoding network to obtain an encoded feature, including: taking the text feature as a second query vector, and taking the second image feature as a second key vector and a second value vector; and inputting the second query vector, the second key vector and the second value vector into a multi-mode coding network, and outputting coding features.
The second key vector and the second value vector are the same, and are both second image features, and the second query vector is a text feature. The second query vector, the second key vector, and the second value vector are input as a set of vectors (Q, K, V) into the multi-modal encoding network, outputting encoding features.
Further, determining, by the multi-modal decoding network, a second decoding feature based on the first decoding feature, the text feature, and the encoding feature, includes: performing dimension transformation processing on the first decoding feature, wherein the dimension of the first decoding feature after the dimension transformation processing is the same as the dimension of the text feature; performing matrix summation operation on the first decoding characteristic and the text characteristic after dimension transformation processing to obtain a summation result; taking the summation result as a third query vector, and taking the coding features as a third key vector and a third value vector; the third query vector, the third key vector, and the third value vector are input into a multi-modal decoding network, outputting a second decoding characteristic.
The dimension transformation of the first decoding feature is the same as the dimension of the text feature, and then the first decoding feature and the text feature are added to obtain a summation result. The third key vector is the same as the third value vector, both are coding features, and the third query vector is the result of the summation. The third query vector, the third key vector, and the third value vector are input as a set of vectors (Q, K, V) into the multi-modal decoding network, outputting a second decoding characteristic.
Further, calculating a feature contrast loss based on the first image feature, the second image feature, and the text feature, comprising: calculating a first contrast loss between the first image feature and the second image feature using a contrast learning loss function; calculating a second contrast loss between the first image feature and the text feature using a contrast learning loss function; calculating a third contrast loss between the second image feature and the text feature using the contrast learning loss function; a feature contrast penalty comprising: first, second and third contrast losses.
The contrast learning loss function may be InfoNCE loss. And weighting and summing the first contrast loss, the second contrast loss and the third contrast loss to obtain the characteristic contrast loss.
Further, calculating the detection loss corresponding to the first detection result includes: the first detection result comprises a first prediction frame and a first prediction category, the labels corresponding to the two images comprise a labeling frame and a labeling category, and the detection loss corresponding to the first detection result comprises a first detection loss, a second detection loss and a third detection loss; calculating a first detection loss between the first prediction category and the labeling category by using the cross entropy loss function; calculating a second detection loss between the first prediction frame and the labeling frame by using the L1 norm loss function; and calculating a third detection loss between the first prediction frame and the labeling frame by using the cross ratio loss function.
The prediction frame is used for representing the position of the object in the predicted image, the labeling frame is used for representing the position of the object in the actual image, the prediction category is used for representing the category of the object in the predicted image, and the labeling category is used for representing the category of the object in the actual image. For example, if the object is a person, the label class is a person, if the object is a dog, the label class is a dog, and if the object is a car, the label class is a car … …
Cross ratio loss functions (IoU, intersection over Union). And carrying out weighted summation on the first detection loss, the second detection loss and the third detection loss to obtain the detection loss corresponding to the first detection result.
Further, calculating the detection loss corresponding to the second detection result includes: the second detection result comprises a second prediction frame and a second prediction category, the labels corresponding to the two images comprise a labeling frame and a labeling category, and the detection loss corresponding to the second detection result comprises a fourth detection loss, a fifth detection loss and a sixth detection loss; calculating a fourth detection loss between the second prediction category and the labeling category by using the cross entropy loss function; calculating a fifth detection loss between the second prediction frame and the labeling frame by using the L1 norm loss function; and calculating a sixth detection loss between the second prediction frame and the labeling frame by using the cross ratio loss function.
And carrying out weighted summation on the fourth detection loss, the fifth detection loss and the sixth detection loss to obtain the detection loss corresponding to the second detection result.
Fig. 2 is a flow chart of another multi-mode target detection method according to an embodiment of the disclosure, as shown in fig. 2, the method includes:
s201, acquiring two target images of different visual angles of a target object and target text descriptions corresponding to the two target images, and inputting the two target images and the corresponding target text descriptions into a trained multi-mode target detection model:
s202, processing two images and text descriptions through an image coding network and a text coding network respectively to obtain a first target image feature, a second target image feature and a target text feature;
s203, processing the first target image feature and the target text feature through a multi-mode decoding network to obtain a first target decoding feature;
s204, processing the first target decoding characteristics through a feedforward network to obtain a first target detection result;
s205, processing the second target image feature and the target text feature through a multi-mode coding network to obtain a target coding feature;
s206, determining a second target decoding feature through the multi-mode decoding network based on the first target decoding feature, the target text feature and the target coding feature;
s207, processing the second target decoding characteristics through a feedforward network to obtain a second target detection result;
and S208, carrying out weighted summation on the first target detection result and the second target detection result to obtain a final detection result.
Or directly taking the second target detection result as a final detection result. The first target detection result is a detection result of a target object in the image corresponding to the first image feature, and the second target detection result is a detection result of the target object in the image corresponding to the second image feature. And processing the two images through an image coding network to obtain a first image characteristic and a second image characteristic.
Any combination of the above optional solutions may be adopted to form an optional embodiment of the present application, which is not described herein in detail.
The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.
Fig. 3 is a schematic diagram of a multi-mode target detection device according to an embodiment of the disclosure. As shown in fig. 3, the multi-modal object detection apparatus includes:
a construction module 301 configured to construct a multi-modal object detection model using an image encoding network, a text encoding network, a multi-modal decoding network, and a feed-forward network;
an acquisition module 302 configured to acquire a training data set, wherein the training data set includes a plurality of sets of training data, each set of training data including two images of different perspectives with respect to a same object and text descriptions corresponding to the two images;
a first processing module 303 configured to input each set of training data into a multi-modal object detection model: processing two images and text descriptions through an image coding network and a text coding network respectively to obtain a first image feature, a second image feature and a text feature;
a second processing module 304 configured to process the first image feature and the text feature through the multi-modal decoding network to obtain a first decoded feature;
a third processing module 305 configured to process the first decoding feature through the feed-forward network to obtain a first detection result;
a fourth processing module 306 configured to process the second image feature and the text feature through the multi-modal encoding network to obtain an encoded feature;
a fifth processing module 307 configured to determine a second decoding feature through the multi-modal decoding network based on the first decoding feature, the text feature, and the encoding feature;
a sixth processing module 308 configured to process the second decoding feature through the feed-forward network to obtain a second detection result;
a calculating module 309 configured to calculate feature contrast loss based on the first image feature, the second image feature, and the text feature, and calculate detection losses corresponding to the first detection result and the second detection result, respectively;
an optimization module 310 configured to optimize model parameters of the multi-modal target detection model in accordance with the feature contrast loss and the detection loss to complete training of the multi-modal target detection model.
According to the technical scheme provided by the embodiment of the application, a multi-mode target detection model is constructed by utilizing an image coding network, a text coding network, a multi-mode decoding network and a feedforward network; acquiring a training data set, wherein the training data set comprises a plurality of groups of training data, and each group of training data comprises two images of different visual angles of the same object and text descriptions corresponding to the two images; inputting each set of training data into a multi-modal target detection model: processing two images and text descriptions through an image coding network and a text coding network respectively to obtain a first image feature, a second image feature and a text feature; processing the first image features and the text features through a multi-mode decoding network to obtain first decoding features; processing the first decoding characteristic through a feedforward network to obtain a first detection result; processing the second image feature and the text feature through a multi-mode coding network to obtain coding features; determining a second decoding feature through the multi-modal decoding network based on the first decoding feature, the text feature, and the encoding feature; processing the second decoding characteristic through a feedforward network to obtain a second detection result; calculating feature contrast loss based on the first image feature, the second image feature and the text feature, and respectively calculating detection loss corresponding to the first detection result and the second detection result; and optimizing model parameters of the multi-modal target detection model according to the characteristic contrast loss and the detection loss so as to complete training of the multi-modal target detection model. By adopting the technical means, the problems that in the prior art, in multi-mode detection of images and texts, misjudgment is easily caused by single-view image learning, a detection result of a small target object or a fuzzy object is easy to generate, and robust cross-mode characterization is difficult to learn under a complex scene are solved, so that the accuracy rate of detecting the small target object or the fuzzy object is improved, and the robust cross-mode characterization is ensured to be learned under the complex scene.
In some embodiments, the second processing module 304 is further configured to treat the text feature as a first query vector, the first image feature as a first key vector and a first value vector; the first query vector, the first key vector and the first value vector are input into a multi-mode decoding network, and the first decoding feature is output.
In some embodiments, the fourth processing module 306 is further configured to treat the text feature as a second query vector and the second image feature as a second key vector and a second value vector; and inputting the second query vector, the second key vector and the second value vector into a multi-mode coding network, and outputting coding features.
In some embodiments, the fifth processing module 307 is further configured to perform a dimension transformation on the first decoded feature, wherein the dimension transformed first decoded feature is the same as the dimension of the text feature; performing matrix summation operation on the first decoding characteristic and the text characteristic after dimension transformation processing to obtain a summation result; taking the summation result as a third query vector, and taking the coding features as a third key vector and a third value vector; the third query vector, the third key vector, and the third value vector are input into a multi-modal decoding network, outputting a second decoding characteristic.
In some embodiments, the computing module 309 is further configured to compute a first contrast loss between the first image feature and the second image feature using a contrast learning loss function; calculating a second contrast loss between the first image feature and the text feature using a contrast learning loss function; calculating a third contrast loss between the second image feature and the text feature using the contrast learning loss function; a feature contrast penalty comprising: first, second and third contrast losses.
In some embodiments, the computing module 309 is further configured to include a first prediction box and a first prediction category for the first detection result, the two image corresponding labels including a labeling box and a labeling category, the detection loss for the first detection result including a first detection loss, a second detection loss, and a third detection loss; calculating a first detection loss between the first prediction category and the labeling category by using the cross entropy loss function; calculating a second detection loss between the first prediction frame and the labeling frame by using the L1 norm loss function; and calculating a third detection loss between the first prediction frame and the labeling frame by using the cross ratio loss function.
In some embodiments, the computing module 309 is further configured to include a second prediction box and a second prediction category for the second detection result, the two image corresponding labels including a labeling box and a labeling category, the detection loss for the second detection result including a fourth detection loss, a fifth detection loss, and a sixth detection loss; calculating a fourth detection loss between the second prediction category and the labeling category by using the cross entropy loss function; calculating a fifth detection loss between the second prediction frame and the labeling frame by using the L1 norm loss function; and calculating a sixth detection loss between the second prediction frame and the labeling frame by using the cross ratio loss function.
In some embodiments, the optimization module 310 is further configured to obtain two target images related to different perspectives of the target object and target text descriptions corresponding to the two target images, input the two target images and the corresponding target text descriptions into the trained multi-modal target detection model, and process the two images and the text descriptions through the image coding network and the text coding network respectively to obtain a first target image feature, a second target image feature and a target text feature; processing the first target image feature and the target text feature through a multi-mode decoding network to obtain a first target decoding feature; processing the first target decoding characteristic through a feedforward network to obtain a first target detection result; processing the second target image feature and the target text feature through a multi-mode coding network to obtain a target coding feature; determining a second target decoding feature through the multi-modal decoding network based on the first target decoding feature, the target text feature, and the target encoding feature; processing the second target decoding characteristic through a feedforward network to obtain a second target detection result; and carrying out weighted summation on the first target detection result and the second target detection result to obtain a final detection result.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not constitute any limitation on the implementation process of the embodiments of the disclosure.
Fig. 4 is a schematic diagram of an electronic device 4 provided by an embodiment of the present disclosure. As shown in fig. 4, the electronic apparatus 4 of this embodiment includes: a processor 401, a memory 402 and a computer program 403 stored in the memory 402 and executable on the processor 401. The steps of the various method embodiments described above are implemented by processor 401 when executing computer program 403. Alternatively, the processor 401, when executing the computer program 403, performs the functions of the modules/units in the above-described apparatus embodiments.
The electronic device 4 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 4 may include, but is not limited to, a processor 401 and a memory 402. It will be appreciated by those skilled in the art that fig. 4 is merely an example of the electronic device 4 and is not limiting of the electronic device 4 and may include more or fewer components than shown, or different components.
The processor 401 may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.
The memory 402 may be an internal storage unit of the electronic device 4, for example, a hard disk or a memory of the electronic device 4. The memory 402 may also be an external storage device of the electronic device 4, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device 4. Memory 402 may also include both internal storage units and external storage devices of electronic device 4. The memory 402 is used to store computer programs and other programs and data required by the electronic device.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method of the above-described embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a Random access Memory (Random AccessMemory, RAM), an electrical carrier signal, a telecommunication signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.
5 the above examples are only for illustrating the technical solution of the present disclosure, and are not limiting thereof; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included in the scope of the present disclosure.

Claims (10)

1. A multi-modal object detection method, comprising:
constructing a multi-mode target detection model by using an image coding network, a text coding network, a multi-mode decoding network and a feedforward network;
acquiring a training data set, wherein the training data set comprises a plurality of groups of training data, and each group of training data comprises two images of different visual angles of the same object and text descriptions corresponding to the two images;
inputting each set of training data into the multi-modal target detection model:
processing two images and text descriptions through the image coding network and the text coding network respectively to obtain a first image feature, a second image feature and a text feature;
processing the first image feature and the text feature through the multi-mode decoding network to obtain a first decoding feature;
processing the first decoding characteristic through the feedforward network to obtain a first detection result;
processing the second image feature and the text feature through the multi-mode coding network to obtain coding features;
determining, by the multi-modal decoding network, a second decoding feature based on the first decoding feature, the text feature, and the encoding feature;
processing the second decoding characteristic through the feedforward network to obtain a second detection result;
calculating feature contrast loss based on the first image feature, the second image feature and the text feature, and respectively calculating detection loss corresponding to the first detection result and the second detection result;
and optimizing model parameters of the multi-mode target detection model according to the characteristic contrast loss and the detection loss so as to complete training of the multi-mode target detection model.
2. The method of claim 1, wherein processing the first image feature and the text feature through the multi-modal decoding network results in a first decoded feature, comprising:
taking the text feature as a first query vector, and taking the first image feature as a first key vector and a first value vector;
and inputting the first query vector, the first key vector and the first value vector into the multi-mode decoding network, and outputting the first decoding characteristic.
3. The method of claim 1, wherein processing the second image feature and the text feature through the multi-modal encoding network results in an encoded feature, comprising:
taking the text feature as a second query vector, and taking the second image feature as a second key vector and a second value vector;
and inputting the second query vector, the second key vector and the second value vector into the multi-mode coding network, and outputting the coding features.
4. The method of claim 1, wherein determining, based on the first decoding feature, the text feature, and the encoding feature, a second decoding feature through the multi-modal decoding network comprises:
performing dimension transformation processing on the first decoding feature, wherein the dimension of the first decoding feature after the dimension transformation processing is the same as the dimension of the text feature;
performing matrix summation operation on the first decoding feature after the dimension transformation processing and the text feature to obtain a summation result;
taking the summation result as a third query vector, and taking the coding features as a third key vector and a third value vector;
inputting the third query vector, the third key vector, and the third value vector into the multi-modal decoding network, outputting the second decoding feature.
5. The method of claim 1, wherein calculating a feature contrast loss based on the first image feature, the second image feature, and the text feature comprises:
calculating a first contrast loss between the first image feature and the second image feature using a contrast learning loss function;
calculating a second contrast loss between the first image feature and the text feature using the contrast learning loss function;
calculating a third contrast loss between the second image feature and the text feature using the contrast learning loss function;
the feature contrast loss comprises: the first contrast loss, the second contrast loss, and the third contrast loss.
6. The method of claim 1, wherein calculating a detection loss corresponding to the first detection result comprises:
the first detection result comprises a first prediction frame and a first prediction category, the two corresponding labels of the images comprise a labeling frame and a labeling category, and the detection loss corresponding to the first detection result comprises a first detection loss, a second detection loss and a third detection loss;
calculating a first detection loss between the first prediction category and the annotation category using a cross entropy loss function;
calculating a second detection loss between the first prediction box and the labeling box by using an L1 norm loss function;
and calculating a third detection loss between the first prediction frame and the labeling frame by using an cross ratio loss function.
7. The method of claim 1, wherein calculating a detection loss corresponding to the second detection result comprises:
the second detection result comprises a second prediction frame and a second prediction category, the two corresponding labels of the images comprise a labeling frame and a labeling category, and the detection loss corresponding to the second detection result comprises a fourth detection loss, a fifth detection loss and a sixth detection loss;
calculating a fourth detection loss between the second prediction category and the annotation category using a cross entropy loss function;
calculating a fifth detection loss between the second prediction box and the labeling box by using an L1 norm loss function;
and calculating a sixth detection loss between the second prediction frame and the labeling frame by using the cross ratio loss function.
8. A multi-modal object detection apparatus, comprising:
a construction module configured to construct a multi-modal object detection model using the image encoding network, the text encoding network, the multi-modal decoding network, and the feed-forward network;
an acquisition module configured to acquire a training data set, wherein the training data set includes a plurality of sets of training data, each set of training data including two images of different perspectives about the same object and text descriptions corresponding to the two images;
a first processing module configured to input each set of training data into the multi-modal object detection model: processing two images and text descriptions through the image coding network and the text coding network respectively to obtain a first image feature, a second image feature and a text feature;
the second processing module is configured to process the first image feature and the text feature through the multi-mode decoding network to obtain a first decoding feature;
the third processing module is configured to process the first decoding characteristic through the feedforward network to obtain a first detection result;
a fourth processing module configured to process the second image feature and the text feature through the multi-modal encoding network to obtain an encoded feature;
a fifth processing module configured to determine a second decoding feature through the multi-modal decoding network based on the first decoding feature, the text feature, and the encoding feature;
a sixth processing module configured to process the second decoding feature through the feed-forward network to obtain a second detection result;
a calculation module configured to calculate feature contrast loss based on the first image feature, the second image feature, and the text feature, and calculate detection losses for the first detection result and the second detection result, respectively;
and the optimizing module is configured to optimize model parameters of the multi-mode target detection model according to the characteristic contrast loss and the detection loss so as to complete training of the multi-mode target detection model.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.
CN202311727983.4A 2023-12-14 2023-12-14 Multi-mode target detection method and device Pending CN117892123A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311727983.4A CN117892123A (en) 2023-12-14 2023-12-14 Multi-mode target detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311727983.4A CN117892123A (en) 2023-12-14 2023-12-14 Multi-mode target detection method and device

Publications (1)

Publication Number Publication Date
CN117892123A true CN117892123A (en) 2024-04-16

Family

ID=90651580

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311727983.4A Pending CN117892123A (en) 2023-12-14 2023-12-14 Multi-mode target detection method and device

Country Status (1)

Country Link
CN (1) CN117892123A (en)

Similar Documents

Publication Publication Date Title
CN110188202B (en) Training method and device of semantic relation recognition model and terminal
CN116778148A (en) Target detection method, target detection device, electronic equipment and storage medium
CN115757731A (en) Dialogue question rewriting method, device, computer equipment and storage medium
Qiu et al. MFIALane: Multiscale feature information aggregator network for lane detection
CN116127925B (en) Text data enhancement method and device based on destruction processing of text
CN113435499A (en) Label classification method and device, electronic equipment and storage medium
CN111523351A (en) Neural network training method and device and electronic equipment
CN110889290B (en) Text encoding method and apparatus, text encoding validity checking method and apparatus
CN114708436B (en) Training method of semantic segmentation model, semantic segmentation method, semantic segmentation device and semantic segmentation medium
CN117892123A (en) Multi-mode target detection method and device
CN116912635B (en) Target tracking method and device
CN117392379B (en) Method and device for detecting target
CN116341555B (en) Named entity recognition method and system
CN117540221B (en) Image processing method and device, storage medium and electronic equipment
CN115905598B (en) Social event abstract generation method, device, terminal equipment and medium
CN116911955B (en) Training method and device for target recommendation model
CN117392260B (en) Image generation method and device
CN116912635A (en) Target tracking method and device
CN118038517A (en) Training method and device of expression recognition model based on fine granularity enhancement features
CN116071760A (en) Training model, character recognition method apparatus, device, and storage medium
CN117473359A (en) Training method and related device of abstract generation model
CN117390213A (en) Training method of image-text retrieval model based on OSCAR and method for realizing image-text retrieval
CN116842384A (en) Multi-mode model training method and device, electronic equipment and readable storage medium
CN117975211A (en) Image processing method and device based on multi-mode information
CN117475136A (en) Image determination method and device of target object

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination