CN117892123A

CN117892123A - Multi-mode target detection method and device

Info

Publication number: CN117892123A
Application number: CN202311727983.4A
Authority: CN
Inventors: 石雅洁
Original assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Current assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date: 2023-12-14
Filing date: 2023-12-14
Publication date: 2024-04-16

Abstract

The disclosure provides a multi-mode target detection method and device. The method comprises the following steps: inputting each set of training data into a multi-modal target detection model: processing two images and text descriptions through an image coding network and a text coding network respectively to obtain a first image feature, a second image feature and a text feature; processing the first image features and the text features through a multi-mode decoding network to obtain first decoding features; processing the first decoding characteristic through a feedforward network to obtain a first detection result; processing the second image feature and the text feature through a multi-mode coding network to obtain coding features; determining a second decoding feature through the multi-modal decoding network based on the first decoding feature, the text feature, and the encoding feature; processing the second decoding characteristic through a feedforward network to obtain a second detection result; calculating a loss based on the first image feature, the second image feature, the text feature, and the first and second detection results; and optimizing the multi-mode target detection model according to the loss.

Description

Multi-mode target detection method and device

Technical Field

The disclosure relates to the technical field of target detection, and in particular relates to a multi-mode target detection method and device.

Background

In recent years, detection models based on multi-modal encoders have attracted widespread attention in academia and industry due to their excellent performance in various cross-modal tasks. The model can learn the cross-modal characteristics by utilizing the image-text pairs, so that rich characteristics can be learned, and the accuracy of detection results is improved. However, the relationships between the modes are often learned through images with a single view angle, which easily causes misjudgment of detection results of small target objects or fuzzy objects, and in complex scenes, the images with the single view angle are learned, so that robust cross-mode characterization is difficult to learn.

Disclosure of Invention

In view of this, the embodiments of the present disclosure provide a multi-modal target detection method, apparatus, electronic device, and computer readable storage medium, so as to solve the problems in the prior art that in multi-modal detection of images and texts, image learning at a single view angle easily causes misjudgment of detection results of small target objects or fuzzy objects, and robust cross-modal characterization is difficult to learn in complex scenes.

In a first aspect of an embodiment of the present disclosure, a method for detecting a multi-modal target is provided, including: constructing a multi-mode target detection model by using an image coding network, a text coding network, a multi-mode decoding network and a feedforward network; acquiring a training data set, wherein the training data set comprises a plurality of groups of training data, and each group of training data comprises two images of different visual angles of the same object and text descriptions corresponding to the two images; inputting each set of training data into a multi-modal target detection model: processing two images and text descriptions through an image coding network and a text coding network respectively to obtain a first image feature, a second image feature and a text feature; processing the first image features and the text features through a multi-mode decoding network to obtain first decoding features; processing the first decoding characteristic through a feedforward network to obtain a first detection result; processing the second image feature and the text feature through a multi-mode coding network to obtain coding features; determining a second decoding feature through the multi-modal decoding network based on the first decoding feature, the text feature, and the encoding feature; processing the second decoding characteristic through a feedforward network to obtain a second detection result; calculating feature contrast loss based on the first image feature, the second image feature and the text feature, and respectively calculating detection loss corresponding to the first detection result and the second detection result; and optimizing model parameters of the multi-modal target detection model according to the characteristic contrast loss and the detection loss so as to complete training of the multi-modal target detection model.

In a second aspect of embodiments of the present disclosure, there is provided a multi-modal object detection apparatus, including: a construction module configured to construct a multi-modal object detection model using the image encoding network, the text encoding network, the multi-modal decoding network, and the feed-forward network; the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is configured to acquire a training data set, wherein the training data set comprises a plurality of groups of training data, and each group of training data comprises two images of different visual angles of the same object and text descriptions corresponding to the two images; a first processing module configured to input each set of training data into a multi-modal object detection model: processing two images and text descriptions through an image coding network and a text coding network respectively to obtain a first image feature, a second image feature and a text feature; the second processing module is configured to process the first image feature and the text feature through the multi-mode decoding network to obtain a first decoding feature; the third processing module is configured to process the first decoding characteristic through the feedforward network to obtain a first detection result; the fourth processing module is configured to process the second image feature and the text feature through the multi-mode coding network to obtain coding features; a fifth processing module configured to determine a second decoding feature through the multi-modal decoding network based on the first decoding feature, the text feature, and the encoding feature; a sixth processing module configured to process the second decoding feature through the feed-forward network to obtain a second detection result; the computing module is configured to compute feature contrast loss based on the first image feature, the second image feature and the text feature, and respectively compute detection loss corresponding to the first detection result and the second detection result; and the optimizing module is configured to optimize model parameters of the multi-mode target detection model according to the characteristic contrast loss and the detection loss so as to complete training of the multi-mode target detection model.

In a third aspect of the disclosed embodiments, an electronic device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a fourth aspect of the disclosed embodiments, a computer-readable storage medium is provided, which stores a computer program which, when executed by a processor, implements the steps of the above-described method.

Compared with the prior art, the embodiment of the disclosure has the beneficial effects that: because the disclosed embodiments construct a multi-modal object detection model by utilizing an image encoding network, a text encoding network, a multi-modal decoding network, and a feed-forward network; acquiring a training data set, wherein the training data set comprises a plurality of groups of training data, and each group of training data comprises two images of different visual angles of the same object and text descriptions corresponding to the two images; inputting each set of training data into a multi-modal target detection model: processing two images and text descriptions through an image coding network and a text coding network respectively to obtain a first image feature, a second image feature and a text feature; processing the first image features and the text features through a multi-mode decoding network to obtain first decoding features; processing the first decoding characteristic through a feedforward network to obtain a first detection result; processing the second image feature and the text feature through a multi-mode coding network to obtain coding features; determining a second decoding feature through the multi-modal decoding network based on the first decoding feature, the text feature, and the encoding feature; processing the second decoding characteristic through a feedforward network to obtain a second detection result; calculating feature contrast loss based on the first image feature, the second image feature and the text feature, and respectively calculating detection loss corresponding to the first detection result and the second detection result; and optimizing model parameters of the multi-modal target detection model according to the characteristic contrast loss and the detection loss so as to complete training of the multi-modal target detection model. By adopting the technical means, the problems that in the prior art, in multi-mode detection of images and texts, misjudgment is easily caused by single-view image learning, a detection result of a small target object or a fuzzy object is easy to generate, and robust cross-mode characterization is difficult to learn under a complex scene are solved, so that the accuracy rate of detecting the small target object or the fuzzy object is improved, and the robust cross-mode characterization is ensured to be learned under the complex scene.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required for the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 is a schematic flow chart of a multi-mode target detection method according to an embodiment of the disclosure;

FIG. 2 is a flow chart of another multi-mode target detection method according to an embodiment of the present disclosure

FIG. 3 is a schematic structural diagram of a multi-modal object detection apparatus according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.

A multi-modal object detection method and apparatus according to embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

Fig. 1 is a flow chart of a multi-mode target detection method according to an embodiment of the disclosure. The multi-modal object detection method of fig. 1 may be performed by a computer or server, or software on a computer or server. As shown in fig. 1, the multi-mode target detection method includes:

s101, constructing a multi-mode target detection model by using an image coding network, a text coding network, a multi-mode decoding network and a feedforward network;

s102, acquiring a training data set, wherein the training data set comprises a plurality of groups of training data, and each group of training data comprises two images of different visual angles of the same object and text descriptions corresponding to the two images;

s103, inputting each group of training data into a multi-modal target detection model: processing two images and text descriptions through an image coding network and a text coding network respectively to obtain a first image feature, a second image feature and a text feature;

s104, processing the first image features and the text features through a multi-mode decoding network to obtain first decoding features;

s105, processing the first decoding characteristic through a feedforward network to obtain a first detection result;

s106, processing the second image feature and the text feature through a multi-mode coding network to obtain coding features;

s107, determining a second decoding feature through the multi-mode decoding network based on the first decoding feature, the text feature and the encoding feature;

s108, processing the second decoding characteristic through a feedforward network to obtain a second detection result;

s109, calculating feature contrast loss based on the first image feature, the second image feature and the text feature, and respectively calculating detection loss corresponding to the first detection result and the second detection result;

s110, optimizing model parameters of the multi-mode target detection model according to the characteristic contrast loss and the detection loss so as to complete training of the multi-mode target detection model.

The Image encoding network and the text encoding network may employ a multi-modal network (CLIP, contrastive Language-Image Pre-translation), the multi-modal encoding network may employ a fransfomer encoder, the multi-modal decoding network may employ a fransfomer decoder, and the feed forward network (FFN, feedforward neural network) may be employed as the multi-modal encoder. The text description corresponding to the two images is a description about information in the two images, including information of objects in the two images and information of scenes.

According to the technical scheme provided by the embodiment of the application, a multi-mode target detection model is constructed by utilizing an image coding network, a text coding network, a multi-mode decoding network and a feedforward network; acquiring a training data set, wherein the training data set comprises a plurality of groups of training data, and each group of training data comprises two images of different visual angles of the same object and text descriptions corresponding to the two images; inputting each set of training data into a multi-modal target detection model: processing two images and text descriptions through an image coding network and a text coding network respectively to obtain a first image feature, a second image feature and a text feature; processing the first image features and the text features through a multi-mode decoding network to obtain first decoding features; processing the first decoding characteristic through a feedforward network to obtain a first detection result; processing the second image feature and the text feature through a multi-mode coding network to obtain coding features; determining a second decoding feature through the multi-modal decoding network based on the first decoding feature, the text feature, and the encoding feature; processing the second decoding characteristic through a feedforward network to obtain a second detection result; calculating feature contrast loss based on the first image feature, the second image feature and the text feature, and respectively calculating detection loss corresponding to the first detection result and the second detection result; and optimizing model parameters of the multi-modal target detection model according to the characteristic contrast loss and the detection loss so as to complete training of the multi-modal target detection model. By adopting the technical means, the problems that in the prior art, in multi-mode detection of images and texts, misjudgment is easily caused by single-view image learning, a detection result of a small target object or a fuzzy object is easy to generate, and robust cross-mode characterization is difficult to learn under a complex scene are solved, so that the accuracy rate of detecting the small target object or the fuzzy object is improved, and the robust cross-mode characterization is ensured to be learned under the complex scene.

Further, processing the first image feature and the text feature through the multi-modal decoding network to obtain a first decoded feature, including: taking the text feature as a first query vector, and taking the first image feature as a first key vector and a first value vector; the first query vector, the first key vector and the first value vector are input into a multi-mode decoding network, and the first decoding feature is output.

It should be noted that the Query vector, the Key vector, and the Value vector in the present application are a Q (Query) vector, a K (Key) vector, and a V (Value) vector, respectively, in the self-attention mechanism.

The first key vector and the first value vector are the same, are both first image features, and the first query vector is a text feature. The first query vector, the first key vector and the first value vector are taken as a group of vectors (Q, K, V), the multi-mode decoding network (Q, K, V) obtains a first decoding characteristic, and the (Q, K, V) represents a vector group.

Further, processing the second image feature and the text feature through the multi-modal encoding network to obtain an encoded feature, including: taking the text feature as a second query vector, and taking the second image feature as a second key vector and a second value vector; and inputting the second query vector, the second key vector and the second value vector into a multi-mode coding network, and outputting coding features.

The second key vector and the second value vector are the same, and are both second image features, and the second query vector is a text feature. The second query vector, the second key vector, and the second value vector are input as a set of vectors (Q, K, V) into the multi-modal encoding network, outputting encoding features.

Further, determining, by the multi-modal decoding network, a second decoding feature based on the first decoding feature, the text feature, and the encoding feature, includes: performing dimension transformation processing on the first decoding feature, wherein the dimension of the first decoding feature after the dimension transformation processing is the same as the dimension of the text feature; performing matrix summation operation on the first decoding characteristic and the text characteristic after dimension transformation processing to obtain a summation result; taking the summation result as a third query vector, and taking the coding features as a third key vector and a third value vector; the third query vector, the third key vector, and the third value vector are input into a multi-modal decoding network, outputting a second decoding characteristic.

The dimension transformation of the first decoding feature is the same as the dimension of the text feature, and then the first decoding feature and the text feature are added to obtain a summation result. The third key vector is the same as the third value vector, both are coding features, and the third query vector is the result of the summation. The third query vector, the third key vector, and the third value vector are input as a set of vectors (Q, K, V) into the multi-modal decoding network, outputting a second decoding characteristic.

Further, calculating a feature contrast loss based on the first image feature, the second image feature, and the text feature, comprising: calculating a first contrast loss between the first image feature and the second image feature using a contrast learning loss function; calculating a second contrast loss between the first image feature and the text feature using a contrast learning loss function; calculating a third contrast loss between the second image feature and the text feature using the contrast learning loss function; a feature contrast penalty comprising: first, second and third contrast losses.

The contrast learning loss function may be InfoNCE loss. And weighting and summing the first contrast loss, the second contrast loss and the third contrast loss to obtain the characteristic contrast loss.

Further, calculating the detection loss corresponding to the first detection result includes: the first detection result comprises a first prediction frame and a first prediction category, the labels corresponding to the two images comprise a labeling frame and a labeling category, and the detection loss corresponding to the first detection result comprises a first detection loss, a second detection loss and a third detection loss; calculating a first detection loss between the first prediction category and the labeling category by using the cross entropy loss function; calculating a second detection loss between the first prediction frame and the labeling frame by using the L1 norm loss function; and calculating a third detection loss between the first prediction frame and the labeling frame by using the cross ratio loss function.

The prediction frame is used for representing the position of the object in the predicted image, the labeling frame is used for representing the position of the object in the actual image, the prediction category is used for representing the category of the object in the predicted image, and the labeling category is used for representing the category of the object in the actual image. For example, if the object is a person, the label class is a person, if the object is a dog, the label class is a dog, and if the object is a car, the label class is a car … …

Cross ratio loss functions (IoU, intersection over Union). And carrying out weighted summation on the first detection loss, the second detection loss and the third detection loss to obtain the detection loss corresponding to the first detection result.

Further, calculating the detection loss corresponding to the second detection result includes: the second detection result comprises a second prediction frame and a second prediction category, the labels corresponding to the two images comprise a labeling frame and a labeling category, and the detection loss corresponding to the second detection result comprises a fourth detection loss, a fifth detection loss and a sixth detection loss; calculating a fourth detection loss between the second prediction category and the labeling category by using the cross entropy loss function; calculating a fifth detection loss between the second prediction frame and the labeling frame by using the L1 norm loss function; and calculating a sixth detection loss between the second prediction frame and the labeling frame by using the cross ratio loss function.

And carrying out weighted summation on the fourth detection loss, the fifth detection loss and the sixth detection loss to obtain the detection loss corresponding to the second detection result.

Fig. 2 is a flow chart of another multi-mode target detection method according to an embodiment of the disclosure, as shown in fig. 2, the method includes:

s201, acquiring two target images of different visual angles of a target object and target text descriptions corresponding to the two target images, and inputting the two target images and the corresponding target text descriptions into a trained multi-mode target detection model:

s202, processing two images and text descriptions through an image coding network and a text coding network respectively to obtain a first target image feature, a second target image feature and a target text feature;

s203, processing the first target image feature and the target text feature through a multi-mode decoding network to obtain a first target decoding feature;

s204, processing the first target decoding characteristics through a feedforward network to obtain a first target detection result;

s205, processing the second target image feature and the target text feature through a multi-mode coding network to obtain a target coding feature;

s206, determining a second target decoding feature through the multi-mode decoding network based on the first target decoding feature, the target text feature and the target coding feature;

s207, processing the second target decoding characteristics through a feedforward network to obtain a second target detection result;

and S208, carrying out weighted summation on the first target detection result and the second target detection result to obtain a final detection result.

Or directly taking the second target detection result as a final detection result. The first target detection result is a detection result of a target object in the image corresponding to the first image feature, and the second target detection result is a detection result of the target object in the image corresponding to the second image feature. And processing the two images through an image coding network to obtain a first image characteristic and a second image characteristic.

Any combination of the above optional solutions may be adopted to form an optional embodiment of the present application, which is not described herein in detail.

The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.

Fig. 3 is a schematic diagram of a multi-mode target detection device according to an embodiment of the disclosure. As shown in fig. 3, the multi-modal object detection apparatus includes:

a construction module 301 configured to construct a multi-modal object detection model using an image encoding network, a text encoding network, a multi-modal decoding network, and a feed-forward network;

an acquisition module 302 configured to acquire a training data set, wherein the training data set includes a plurality of sets of training data, each set of training data including two images of different perspectives with respect to a same object and text descriptions corresponding to the two images;

a first processing module 303 configured to input each set of training data into a multi-modal object detection model: processing two images and text descriptions through an image coding network and a text coding network respectively to obtain a first image feature, a second image feature and a text feature;

a second processing module 304 configured to process the first image feature and the text feature through the multi-modal decoding network to obtain a first decoded feature;

a third processing module 305 configured to process the first decoding feature through the feed-forward network to obtain a first detection result;

a fourth processing module 306 configured to process the second image feature and the text feature through the multi-modal encoding network to obtain an encoded feature;

a fifth processing module 307 configured to determine a second decoding feature through the multi-modal decoding network based on the first decoding feature, the text feature, and the encoding feature;

a sixth processing module 308 configured to process the second decoding feature through the feed-forward network to obtain a second detection result;

a calculating module 309 configured to calculate feature contrast loss based on the first image feature, the second image feature, and the text feature, and calculate detection losses corresponding to the first detection result and the second detection result, respectively;

an optimization module 310 configured to optimize model parameters of the multi-modal target detection model in accordance with the feature contrast loss and the detection loss to complete training of the multi-modal target detection model.

In some embodiments, the second processing module 304 is further configured to treat the text feature as a first query vector, the first image feature as a first key vector and a first value vector; the first query vector, the first key vector and the first value vector are input into a multi-mode decoding network, and the first decoding feature is output.

In some embodiments, the fourth processing module 306 is further configured to treat the text feature as a second query vector and the second image feature as a second key vector and a second value vector; and inputting the second query vector, the second key vector and the second value vector into a multi-mode coding network, and outputting coding features.

In some embodiments, the fifth processing module 307 is further configured to perform a dimension transformation on the first decoded feature, wherein the dimension transformed first decoded feature is the same as the dimension of the text feature; performing matrix summation operation on the first decoding characteristic and the text characteristic after dimension transformation processing to obtain a summation result; taking the summation result as a third query vector, and taking the coding features as a third key vector and a third value vector; the third query vector, the third key vector, and the third value vector are input into a multi-modal decoding network, outputting a second decoding characteristic.

In some embodiments, the computing module 309 is further configured to compute a first contrast loss between the first image feature and the second image feature using a contrast learning loss function; calculating a second contrast loss between the first image feature and the text feature using a contrast learning loss function; calculating a third contrast loss between the second image feature and the text feature using the contrast learning loss function; a feature contrast penalty comprising: first, second and third contrast losses.

In some embodiments, the computing module 309 is further configured to include a first prediction box and a first prediction category for the first detection result, the two image corresponding labels including a labeling box and a labeling category, the detection loss for the first detection result including a first detection loss, a second detection loss, and a third detection loss; calculating a first detection loss between the first prediction category and the labeling category by using the cross entropy loss function; calculating a second detection loss between the first prediction frame and the labeling frame by using the L1 norm loss function; and calculating a third detection loss between the first prediction frame and the labeling frame by using the cross ratio loss function.

In some embodiments, the computing module 309 is further configured to include a second prediction box and a second prediction category for the second detection result, the two image corresponding labels including a labeling box and a labeling category, the detection loss for the second detection result including a fourth detection loss, a fifth detection loss, and a sixth detection loss; calculating a fourth detection loss between the second prediction category and the labeling category by using the cross entropy loss function; calculating a fifth detection loss between the second prediction frame and the labeling frame by using the L1 norm loss function; and calculating a sixth detection loss between the second prediction frame and the labeling frame by using the cross ratio loss function.

In some embodiments, the optimization module 310 is further configured to obtain two target images related to different perspectives of the target object and target text descriptions corresponding to the two target images, input the two target images and the corresponding target text descriptions into the trained multi-modal target detection model, and process the two images and the text descriptions through the image coding network and the text coding network respectively to obtain a first target image feature, a second target image feature and a target text feature; processing the first target image feature and the target text feature through a multi-mode decoding network to obtain a first target decoding feature; processing the first target decoding characteristic through a feedforward network to obtain a first target detection result; processing the second target image feature and the target text feature through a multi-mode coding network to obtain a target coding feature; determining a second target decoding feature through the multi-modal decoding network based on the first target decoding feature, the target text feature, and the target encoding feature; processing the second target decoding characteristic through a feedforward network to obtain a second target detection result; and carrying out weighted summation on the first target detection result and the second target detection result to obtain a final detection result.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not constitute any limitation on the implementation process of the embodiments of the disclosure.

Fig. 4 is a schematic diagram of an electronic device 4 provided by an embodiment of the present disclosure. As shown in fig. 4, the electronic apparatus 4 of this embodiment includes: a processor 401, a memory 402 and a computer program 403 stored in the memory 402 and executable on the processor 401. The steps of the various method embodiments described above are implemented by processor 401 when executing computer program 403. Alternatively, the processor 401, when executing the computer program 403, performs the functions of the modules/units in the above-described apparatus embodiments.

The electronic device 4 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 4 may include, but is not limited to, a processor 401 and a memory 402. It will be appreciated by those skilled in the art that fig. 4 is merely an example of the electronic device 4 and is not limiting of the electronic device 4 and may include more or fewer components than shown, or different components.

The processor 401 may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

The memory 402 may be an internal storage unit of the electronic device 4, for example, a hard disk or a memory of the electronic device 4. The memory 402 may also be an external storage device of the electronic device 4, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device 4. Memory 402 may also include both internal storage units and external storage devices of electronic device 4. The memory 402 is used to store computer programs and other programs and data required by the electronic device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method of the above-described embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a Random access Memory (Random AccessMemory, RAM), an electrical carrier signal, a telecommunication signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.

5 the above examples are only for illustrating the technical solution of the present disclosure, and are not limiting thereof; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included in the scope of the present disclosure.

Claims

1. A multi-modal object detection method, comprising:

constructing a multi-mode target detection model by using an image coding network, a text coding network, a multi-mode decoding network and a feedforward network;

acquiring a training data set, wherein the training data set comprises a plurality of groups of training data, and each group of training data comprises two images of different visual angles of the same object and text descriptions corresponding to the two images;

inputting each set of training data into the multi-modal target detection model:

processing two images and text descriptions through the image coding network and the text coding network respectively to obtain a first image feature, a second image feature and a text feature;

processing the first image feature and the text feature through the multi-mode decoding network to obtain a first decoding feature;

processing the first decoding characteristic through the feedforward network to obtain a first detection result;

processing the second image feature and the text feature through the multi-mode coding network to obtain coding features;

determining, by the multi-modal decoding network, a second decoding feature based on the first decoding feature, the text feature, and the encoding feature;

processing the second decoding characteristic through the feedforward network to obtain a second detection result;

calculating feature contrast loss based on the first image feature, the second image feature and the text feature, and respectively calculating detection loss corresponding to the first detection result and the second detection result;

and optimizing model parameters of the multi-mode target detection model according to the characteristic contrast loss and the detection loss so as to complete training of the multi-mode target detection model.

2. The method of claim 1, wherein processing the first image feature and the text feature through the multi-modal decoding network results in a first decoded feature, comprising:

taking the text feature as a first query vector, and taking the first image feature as a first key vector and a first value vector;

and inputting the first query vector, the first key vector and the first value vector into the multi-mode decoding network, and outputting the first decoding characteristic.

3. The method of claim 1, wherein processing the second image feature and the text feature through the multi-modal encoding network results in an encoded feature, comprising:

taking the text feature as a second query vector, and taking the second image feature as a second key vector and a second value vector;

and inputting the second query vector, the second key vector and the second value vector into the multi-mode coding network, and outputting the coding features.

4. The method of claim 1, wherein determining, based on the first decoding feature, the text feature, and the encoding feature, a second decoding feature through the multi-modal decoding network comprises:

performing dimension transformation processing on the first decoding feature, wherein the dimension of the first decoding feature after the dimension transformation processing is the same as the dimension of the text feature;

performing matrix summation operation on the first decoding feature after the dimension transformation processing and the text feature to obtain a summation result;

taking the summation result as a third query vector, and taking the coding features as a third key vector and a third value vector;

inputting the third query vector, the third key vector, and the third value vector into the multi-modal decoding network, outputting the second decoding feature.

5. The method of claim 1, wherein calculating a feature contrast loss based on the first image feature, the second image feature, and the text feature comprises:

calculating a first contrast loss between the first image feature and the second image feature using a contrast learning loss function;

calculating a second contrast loss between the first image feature and the text feature using the contrast learning loss function;

calculating a third contrast loss between the second image feature and the text feature using the contrast learning loss function;

the feature contrast loss comprises: the first contrast loss, the second contrast loss, and the third contrast loss.

6. The method of claim 1, wherein calculating a detection loss corresponding to the first detection result comprises:

the first detection result comprises a first prediction frame and a first prediction category, the two corresponding labels of the images comprise a labeling frame and a labeling category, and the detection loss corresponding to the first detection result comprises a first detection loss, a second detection loss and a third detection loss;

calculating a first detection loss between the first prediction category and the annotation category using a cross entropy loss function;

calculating a second detection loss between the first prediction box and the labeling box by using an L1 norm loss function;

and calculating a third detection loss between the first prediction frame and the labeling frame by using an cross ratio loss function.

7. The method of claim 1, wherein calculating a detection loss corresponding to the second detection result comprises:

the second detection result comprises a second prediction frame and a second prediction category, the two corresponding labels of the images comprise a labeling frame and a labeling category, and the detection loss corresponding to the second detection result comprises a fourth detection loss, a fifth detection loss and a sixth detection loss;

calculating a fourth detection loss between the second prediction category and the annotation category using a cross entropy loss function;

calculating a fifth detection loss between the second prediction box and the labeling box by using an L1 norm loss function;

and calculating a sixth detection loss between the second prediction frame and the labeling frame by using the cross ratio loss function.

8. A multi-modal object detection apparatus, comprising:

a construction module configured to construct a multi-modal object detection model using the image encoding network, the text encoding network, the multi-modal decoding network, and the feed-forward network;

an acquisition module configured to acquire a training data set, wherein the training data set includes a plurality of sets of training data, each set of training data including two images of different perspectives about the same object and text descriptions corresponding to the two images;

a first processing module configured to input each set of training data into the multi-modal object detection model: processing two images and text descriptions through the image coding network and the text coding network respectively to obtain a first image feature, a second image feature and a text feature;

the second processing module is configured to process the first image feature and the text feature through the multi-mode decoding network to obtain a first decoding feature;

the third processing module is configured to process the first decoding characteristic through the feedforward network to obtain a first detection result;

a fourth processing module configured to process the second image feature and the text feature through the multi-modal encoding network to obtain an encoded feature;

a fifth processing module configured to determine a second decoding feature through the multi-modal decoding network based on the first decoding feature, the text feature, and the encoding feature;

a sixth processing module configured to process the second decoding feature through the feed-forward network to obtain a second detection result;

a calculation module configured to calculate feature contrast loss based on the first image feature, the second image feature, and the text feature, and calculate detection losses for the first detection result and the second detection result, respectively;

and the optimizing module is configured to optimize model parameters of the multi-mode target detection model according to the characteristic contrast loss and the detection loss so as to complete training of the multi-mode target detection model.

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.