CN116958512A

CN116958512A - Target detection method, target detection device, computer readable medium and electronic equipment

Info

Publication number: CN116958512A
Application number: CN202310616089.3A
Authority: CN
Inventors: 张梦丹; 陈珮娴; 傅朝友
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-10-27

Abstract

The embodiment of the application provides a target detection method, a target detection device, a computer readable medium and electronic equipment, wherein the method comprises the following steps: acquiring a picture and a text description corresponding to the picture, and masking the text description at least once to obtain a masked text description corresponding to each masking, wherein masking objects are masked nouns; extracting regional feature vectors of the pictures and a group of word embedding vectors corresponding to the text description after masking; performing multi-head self-attention fusion operation according to at least one part of the regional feature vectors and each group of word embedding vectors, and predicting to obtain predicted features of mask nouns; determining text mask loss according to the predicted features and the masked nouns, and training according to the text mask loss to obtain a target detection model, wherein the text mask loss is used for measuring the difference between the predicted features and the masked nouns; and carrying out target detection on the picture to be detected based on the target detection model. The embodiment of the application can effectively improve the detection precision of the model on the new category.

Description

Target detection method, target detection device, computer readable medium and electronic equipment

Technical Field

The present application relates to the field of target detection technologies, and in particular, to a target detection method, a target detection device, a computer readable medium, and an electronic device.

Background

Object detection is one of the most fundamental tasks in computer vision, aimed at locating and identifying objects of different sizes, categories in an image.

Currently, a supervised target detection method is often adopted for target detection, however, the method requires expensive labeling cost and cannot effectively detect targets of new types.

In recent times, methods have also emerged that utilize open vocabulary target detection tasks to train detection models, however, the accuracy of such methods remains low.

Disclosure of Invention

The embodiment of the application provides a target detection method, a target detection device, a computer-readable medium and electronic equipment, so that the detection precision of a model on a new category can be effectively improved while the labeling cost is saved at least to a certain extent.

Other features and advantages of the application will be apparent from the following detailed description, or may be learned by the practice of the application.

According to an aspect of an embodiment of the present application, there is provided a target detection method, the method including: acquiring a picture and a text description corresponding to the picture, masking the text description at least once to obtain a masked text description corresponding to each masking, wherein mask objects of the text description are masked nouns; extracting regional feature vectors corresponding to a plurality of regions of the picture respectively, and extracting a group of word embedding vectors corresponding to each of the masked text descriptions respectively; performing multi-head self-attention fusion operation according to at least one part of each regional feature vector and each group of word embedding vectors so as to predict and obtain predicted features of mask nouns; determining text mask loss according to the predicted features and the masked nouns, and training according to the text mask loss to obtain a target detection model, wherein the text mask loss is used for measuring the difference between the predicted features and the masked nouns; and carrying out target detection on the picture to be detected based on the target detection model.

According to an aspect of an embodiment of the present application, there is provided an object detection apparatus including: the device comprises an acquisition unit, a mask processing unit and a mask processing unit, wherein the acquisition unit is used for acquiring a picture and a text description corresponding to the picture, masking the text description at least once to obtain a masked text description corresponding to each masking, and masking objects of the text description are masked nouns; the extraction unit is used for extracting regional feature vectors respectively corresponding to a plurality of regions of the picture, and extracting a group of word embedding vectors respectively corresponding to each of the masked text descriptions; the self-attention fusion unit is used for carrying out multi-head self-attention fusion operation according to at least one part of the feature vectors of each region and each group of word embedding vectors so as to predict and obtain the predicted features of the mask nouns; the model training unit is used for determining text mask loss according to the predicted features and the masked nouns, training according to the text mask loss to obtain a target detection model, and measuring the difference between the predicted features and the masked nouns; and the target detection unit is used for carrying out target detection on the picture to be detected based on the target detection model.

In some embodiments of the application, based on the foregoing scheme, the self-attention fusion unit is configured to: performing multi-head self-attention fusion operation according to at least one part of each regional characteristic vector and each group of word embedding vectors to predict and obtain predicted characteristics of mask nouns, and determining attention activation values between the predicted characteristics of the mask nouns and the regional characteristic vectors; the model training unit is configured to: and determining target diversity loss according to the attention activation value, and training according to the target diversity loss and the text mask loss to obtain a target detection model, wherein the target diversity loss is used for enhancing the attention of the word embedding vector to a target area matched with the word embedding vector and weakening the attention of other areas except the target area in the plurality of areas.

In some embodiments of the present application, based on the foregoing, the apparatus further includes a detection data set acquisition unit, a teletext data set acquisition unit, and a training unit; before masking the text description at least once, the detection dataset acquisition unit is configured to: acquiring a detection data set comprising a plurality of detection samples, wherein the detection samples comprise pictures and labeling information corresponding to targets in the pictures, and the labeling information comprises real labeling frames and categories; the image-text data set acquisition unit is used for: acquiring an image-text data set comprising a plurality of image-text description pairs, wherein the image-text description pairs comprise images and text descriptions corresponding to the images; the training unit is used for: and training an original target detection model based on the detection data set and the image-text data set, wherein the target detection model is obtained by training the trained original target detection model.

In some embodiments of the application, based on the foregoing, the training unit is configured to: inputting the picture-text description in the picture-text data set into the original target detection model in batches to extract a plurality of regional feature vectors of pictures in the picture-text description pair of each batch and a plurality of word embedding vectors of the text description in the picture-text description pair of each batch; determining graph description contrast loss according to each region feature vector and each word embedding vector corresponding to the graph-text description pairs in the same batch, wherein the graph description contrast loss is used for learning the mapping relation between the graph and the text description; respectively inputting each picture in the detection data set into the original target detection model, determining region recommendation loss according to a predicted anchor frame and a real annotation frame for generating a region, determining detection classification loss according to an output classification result and a corresponding category, and determining detection frame regression loss according to an output detection frame prediction result and a corresponding real annotation frame, wherein the region recommendation loss is used for measuring the accuracy of the predicted anchor frame, the detection classification loss is used for measuring the accuracy of the output classification result, and the detection frame regression loss is used for measuring the accuracy of the output detection frame prediction result; and training an original target detection model according to the graphic description comparison loss, the region recommendation loss, the detection classification loss and the detection frame regression loss.

In some embodiments of the application, based on the foregoing, the object detection model outputs the class of the object through a classification head, the classification head being a fixed first text-embedding matrix, the first text-embedding matrix being pre-trained.

In some embodiments of the present application, based on the foregoing solution, the object detection of the picture to be detected based on the object detection model is performed when the first text embedding matrix matches a category to be detected.

In some embodiments of the present application, based on the foregoing scheme, the category to be detected is different from the category in the annotation information of the detection dataset.

In some embodiments of the application, based on the foregoing scheme, the self-attention fusion unit is configured to: according to the similarity between the word embedding vector of the mask noun and each regional feature vector, a preset number of regional feature vectors are selected from the regional feature vectors; and performing multi-head self-attention fusion operation according to the screened regional feature vectors and each group of word embedding vectors.

In some embodiments of the present application, based on the foregoing solution, the graphic description contrast loss includes a matching loss of the picture and each text description in the same batch and a matching loss of the text description and each picture in the same batch, where the graphic description contrast loss is calculated according to an overall similarity between the picture and the text description.

According to an aspect of the embodiments of the present application, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the object detection method as described in the above embodiments.

According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: one or more processors; and a storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the target detection method as described in the above embodiments.

According to an aspect of an embodiment of the present application, there is provided a computer program product including computer instructions stored in a computer-readable storage medium, from which computer instructions a processor of a computer device reads, the processor executing the computer instructions, causing the computer device to perform the object detection method as described in the above embodiment.

In the technical schemes provided by some embodiments of the present application, after obtaining a picture and a text description corresponding to the picture, masking the text description one or more times, and obtaining a corresponding masked text description after masking each time; then, extracting regional feature vectors according to the pictures respectively, and extracting word embedding vectors according to the text description after masking; and then, performing multi-head self-attention fusion operation on at least one part of the feature vectors of each region and each group of word embedding vectors, so as to predict and obtain the predicted features of the mask nouns, further determining text mask loss according to the predicted features and the mask nouns, and finally training a target detection model capable of carrying out target detection on the picture to be detected by utilizing the text mask loss. Because the embodiment of the application uses the pictures and the text descriptions corresponding to the pictures to train the model, a large-scale target detection data set is not needed to train, and the model obtained by training can effectively detect new types of targets, thereby greatly saving the labeling cost; more importantly, since the text mask loss is determined according to the predicted features and the mask nouns, the predicted features are predicted by carrying out multi-head self-attention fusion operation on at least one part of the feature vectors of each region and each group of word embedding vectors, and interaction of image-text context information can be realized through the multi-head self-attention fusion operation, so that influence of sample noise can be reduced, the overall performance and generalization of the model are obviously improved, and the detection precision of the model on new categories is effectively improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:

FIG. 1 shows a schematic diagram of a related art model training architecture in comparison with the present application;

FIG. 2 shows a schematic diagram of an exemplary system architecture to which the technical solution of an embodiment of the application may be applied;

FIG. 3 shows a flow chart of a method of object detection according to one embodiment of the application;

FIG. 4 shows a schematic diagram of a model training architecture according to one embodiment of the application;

FIG. 5 shows a flowchart of steps followed by masking a text description at least once, according to one embodiment of the application;

FIG. 6 illustrates a flow chart for training an original object detection model based on a detection dataset and a teletext dataset, according to an embodiment of the application;

FIG. 7 illustrates an architecture diagram for training with the introduction of a contextual feature fusion module according to one embodiment of the present application;

FIG. 8 shows a flowchart of details of step 330 in the embodiment of FIG. 3, according to one embodiment of the application;

FIG. 9 shows a flowchart of details of steps 330 and 340 in the embodiment of FIG. 3, according to one embodiment of the application;

FIG. 10 shows a block diagram of an object detection device according to one embodiment of the application;

fig. 11 shows a schematic diagram of a computer system suitable for use in implementing an embodiment of the application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the application may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

Object detection is one of the most fundamental tasks in computer vision, aimed at locating and identifying objects of different sizes, categories in an image. Along with the gushing of massive image data of a network platform, dense distribution of urban cameras and wide application of low-altitude unmanned aerial vehicle shooting, the image detection technology is widely applied to scenes such as social entertainment, security and low-altitude observation, and the like, and has great practical application significance. However, the supervised target detection method requires labeling of locations and categories for a large number of samples, requires expensive labeling costs, and the detectable categories are limited by labeled detection datasets, which are not sufficient for new category detection.

The Open vocabulary target detection task (Open-vocabulary Detection, OVD) utilizes a large-scale image-text data set only containing image text description labels to assist in training a detection model, and the number of categories supported by the model is increased, so that the manual labeling cost is greatly saved. The prior OVD method generally learns the point-to-point image-text corresponding relation between a certain vocabulary in text description and a certain detection frame in an image, and lacks the understanding of reasonable image-text context relation, thereby limiting the accuracy of the OVD.

According to the open set detection method in the related technology, a part of training strategies of the image-text pre-training model are used for reference, and example-level image features are learned on image-text data so as to be aligned with example category vocabulary features. And the other part uses a knowledge distillation mode to transfer the image-text pre-training model features with better generalization capability into the detection model, so that the recognition capability of the detection model on new types is enhanced.

However, in the open set detection method in the related art, a point-to-point image-text corresponding relation between a certain vocabulary in text description and a certain detection frame in an image is usually learned, and due to the lack of real labeling of image region frame-class vocabulary, the existing method obtains a sample pair through a pre-training image-text model and a pre-training detection model, and larger noise often exists in the sample pair to influence the detection model learning; meanwhile, the detection model based on the noisy samples lacks reasonable understanding of the context relationship of the image and text, so that the accuracy of open set detection is limited.

FIG. 1 shows a schematic diagram of a related art model training architecture in comparison with the present application. Referring to fig. 1, the conventional open set detection method uses a conventional OVD detector (General OVD detector) that matches image area frames and category vocabularies in a point-to-point (point-to-point) manner, as shown in the detection example diagram in the lower left corner of fig. 1, the frame prediction accuracy of "umbrella" is low, and the model cannot understand the relative positional relationship of "person", "train" and "umbrella" in sentences.

Meanwhile, the target detection task needs to identify the accurate positions and sizes of all types of corresponding examples in the whole graph, and the independence and the diversity of the examples are ignored by excessively considering the context information, so that the model outputs an image-level large frame, and the global context is greedily contained.

As shown in fig. 1, training of the model using the general graphic feature fusion module (Transformer Fusion model) and the native mask language modeling (Vanilla Mask Language Modeling, vanella MLM) loss method can bias the model overall towards large frame prediction, and can make the prediction result redundant (Redundant content).

For this purpose, the application firstly provides a target detection method, which is a new open vocabulary target detection framework DCK-OVD (Diverse Contextual Knowledge for Open Vocabulary Detection, open set detection algorithm based on differentiated image-text context information). The target detection method provided by the embodiment of the application can overcome the defects, and explore the path for effectively introducing the context information into the open set detection, thereby making up the problem that the point-to-point example level graph-text relation learning is affected by noise, and improving the performance of the OVD task.

Referring to fig. 1, in the embodiment of the present application, a context feature fusion module is added after a conventional OVD detector, and an instance region feature and a text description of a certain instance vocabulary masked are input into the context feature fusion module to perform context information interaction, so as to predict a category corresponding to a specific vocabulary of the mask, and effectively monitor the loss of the MLM (Detection oriented MLM, detection oriented Mask Language Modeling) with detection as a guide; if training is performed based on the loss only, the context feature fusion module predicts mask words using the context information of the graphics context, and in the non-limiting case, model learning tends to predict all mask words using the global information of the image-level large frames without focusing on the differences of each image instance frame, contrary to the detection task. Therefore, the embodiment of the application also introduces target variability loss into the MLM learning, so that the MLM is more suitable for detection tasks, and the context feature fusion module can pay attention to different context information for different mask words.

Fig. 2 shows a schematic diagram of an exemplary system architecture to which the technical solution of an embodiment of the present application may be applied. As shown in fig. 2, the system architecture 200 includes a vehicle 210, a cloud 220, a graphic database 230 and a target detection database 240, wherein the graphic database 230 stores graphic data sets, which include a plurality of image-text description pairs, the target detection database 240 stores detection data sets for training target detection capabilities of models, and communication can be performed between the vehicle 210 and the cloud 220, and between the graphic database 230 and the target detection database 240 and the cloud 220, wherein the vehicle 210 specifically includes a camera 211 for capturing an environmental image around the vehicle 210, and a training frame to be trained is provided in the cloud 220. When the object detection method provided by the embodiment of the present application is applied to the system architecture shown in fig. 2, one process may be as follows: first, the cloud 220 acquires a graphic data set from the graphic database 230 and acquires a detection data set from the target detection database 240; then, the cloud end 220 trains the training frame to be trained by using the image-text data set and the detection data set, and obtains a trained training frame after training is completed; next, the cloud 220 builds a target detection model based on the trained training framework, and deploys the target detection model; next, the vehicle 210 starts and runs, the camera 211 of the vehicle 210 collects an environmental image around the vehicle 210, and uploads the environmental image to the cloud 220 through the communication module of the vehicle 210, the cloud 220 performs target detection on the environmental image by using the target detection model, and feeds back a control instruction to the vehicle 210 in real time according to a target detection result so as to guide auxiliary driving of the vehicle.

In some embodiments of the present application, the data in the set of teletext data in the teletext database and the set of detection data in the target detection database 240 are periodically increased, and the cloud 220 trains the training framework based on the newly increased data.

In some embodiments of the present application, the training of the training frame to be trained by the cloud 220 using the teletext data set and the training of the training frame to be trained by the cloud 220 using the detection data set are performed alternately.

In some embodiments of the present application, the cloud 220 first trains the training frame to be trained using the detection data set, and after the training is finished, trains the training frame to be trained using the graphic data set.

In some embodiments of the present application, the size of the set of teletext data in the teletext database is much larger than the size of the set of detection data in the target detection database 240.

It should be understood that the number of vehicles, cameras on vehicles, and databases in fig. 2 is merely illustrative. Any number of vehicles can be provided, any number of cameras can be arranged at various positions on the vehicles, and a greater number of databases can be arranged for providing data required for training according to the implementation requirements.

It should be noted that fig. 2 shows only one embodiment of the present application. Although the solution of the embodiment of fig. 2 is used in the intelligent driving field of vehicles, in other embodiments of the present application, the solution may also be applied in various other fields, for example, in the automatic sorting scene of articles in the intelligent logistics field, and also in the social entertainment scene; although in the solution of the embodiment of fig. 2, the training and deployment of the model are performed at the cloud end, in other embodiments of the present application, the model may be deployed on various other types of terminal devices, such as a smart phone, a tablet computer, a notebook computer, a wearable device, a desktop computer, a notebook computer, and the like, or the model may be trained on various types of devices, such as a server, a desktop computer, a notebook computer, and a workstation, and the like; although the embodiment of fig. 2 is configured to obtain the teletext data set and the detection data set from the database, in other embodiments of the application, the teletext data set and the detection data set may be obtained from any other device capable of storing data, for example, the teletext data set and the detection data set may be directly built in the cloud. The embodiments of the present application should not be limited in any way, nor should the scope of the application be limited in any way.

It is easy to understand that the target detection method provided by the embodiment of the application is generally executed by a cloud server in the cloud, and accordingly, the target detection device is generally arranged in the cloud server. However, in other embodiments of the present application, the user terminal may also have a similar function to the cloud server, so as to execute the target detection scheme provided by the embodiments of the present application.

Therefore, the embodiment of the application can be applied to a terminal or a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

The implementation details of the technical scheme of the embodiment of the application are described in detail below:

fig. 3 shows a flow chart of a method of object detection according to one embodiment of the application, which may be performed by various computing and processing capable devices, such as a user terminal including, but not limited to, a cell phone, a computer, a smart voice interaction device, a smart home appliance, a vehicle terminal, an aircraft, a smart watch, etc., or a cloud server. Referring to fig. 3, the target detection method at least includes the following steps:

in step 310, a picture and a text description corresponding to the picture are obtained, and at least one masking is performed on the text description to obtain a masked text description corresponding to each masking, where the masking object of the text description is a masked noun.

This step and the steps following this step relate to training of the target training framework, and in the following, how to obtain the target training framework will be described in detail with reference to fig. 4 to 6.

FIG. 4 shows a schematic diagram of a model training architecture according to one embodiment of the application. Referring to fig. 4, a training architecture used by a training framework of the object detection model is shown, where the training architecture includes a language backbone network 410, a visual backbone network 440, a Context feature fusion module (Context-based Fusion model) 430, an RPN (Region Proposal Network, region recommendation network) 450, a region of interest pooling layer (ROI Pooling Layer) 460, a regression header 480, and a classification header 490, the region of interest pooling layer 460 is connected to the RPN450 and the visual backbone network 440, respectively, and the RPN450 is also connected to the visual backbone network 440. The 6 rounded rectangles shown in fig. 4 are loss functions for training.

FIG. 5 shows a flowchart of steps followed by masking a text description at least once, according to one embodiment of the application. Referring to fig. 5, before masking the text description at least once, the object detection method may include the steps of:

in step 510, a detection dataset comprising a plurality of detection samples is obtained, the detection samples comprising a picture and annotation information corresponding to a target in the picture, the annotation information comprising a true annotation frame and a category.

The picture may be an image or a photograph in various formats such as jpeg and bmp, or may be a video image frame included in a certain video. Typically, the picture may be taken by a camera, but in some cases, the picture may also be automatically generated by a computer device.

The object in the picture can be various objects, such as creatures like people and animals, and various objects like umbrellas and household appliances. A picture may contain one or more objects. The real annotation frame is used for indicating the position of the target, the real annotation frame can be an external rectangle of the target, and the category is used for indicating the classification of the target. Labeling information is typically manually based on experience labeling. For example, a picture is a photograph of a girl with an umbrella, and then the picture can contain two targets of the umbrella and the girl, the category corresponding to the girl is "girl", the true mark frame indicates the position of the girl in the picture, the category corresponding to the umbrella is "umbrella", and the true mark frame indicates the position of the umbrella in the picture.

In step 520, a set of teletext data comprising a plurality of picture-text description pairs is obtained, the picture-text description pairs comprising a picture and a text description corresponding to the picture.

The picture and the text description corresponding to the picture in step 310 may be obtained from the set of teletext data in this step, i.e. the picture and the text description corresponding to the picture obtained in step 310 are one or more of a plurality of picture-text description pairs of the set of teletext data in this step; of course, the picture and the text description corresponding to the picture in step 310 may be obtained from other data sets.

The text description includes a plurality of words, and the text description may be a sentence, a piece of text, or even an article. The text description corresponding to the picture is text information related to the picture. For example, a picture may be an illustration of an article, and a textual description corresponding to the picture may be a title of the article or body content of the article.

The number of picture-text description pairs in a teletext data set is typically greater than or even much greater than the number of detection samples in a detection data set.

In step 530, an original target detection model is trained based on the detection dataset and the teletext dataset, wherein the target detection model is obtained by training the trained original target detection model.

The original target detection model is an original training frame, and the original training frame is subjected to preliminary training to obtain the original training frame after preliminary training.

The object detection model, i.e., the model trained in steps 310-340, will be described in detail below.

FIG. 6 illustrates a flow chart for training an original object detection model based on a detection dataset and a teletext dataset, according to an embodiment of the application. As shown in fig. 6, training the original target detection model based on the detection data set and the graphic data set may specifically include the following steps:

in step 610, the picture-text descriptions in the teletext data sets are input in batches into the original target detection model to extract a number of regional feature vectors for the pictures in each batch of picture-text description pairs and a number of word embedding vectors for the text descriptions in each batch of picture-text description pairs.

With continued reference to fig. 4, the language backbone network 410, visual backbone network 440, RPN450, region of interest pooling layer 460, regression header 480, and classification header 490 in the training architecture belong to the original training framework, i.e., the original target detection model.

Inputting each picture-text description pair into the original object detection model, wherein the pictures in the picture-text description pair are input into the visual backbone network 440, and a plurality of region feature vectors 470 of the pictures can be extracted through the visual backbone network 440, the RPN450 and the region of interest pooling layer 460; the text descriptions in the picture-text description pair are input to the language backbone network 410, and word embedding vectors 420 corresponding to each word in the text description can be extracted through the language backbone network 410. Specifically, visual backbone 440 extracts a corresponding feature map (feature map) from the picture and inputs the feature map to RPN450, and visual backbone 440 may employ res net. The RPN450 outputs a corresponding candidate box (proposal) based on the input of the feature map. Specifically, the RPN450 generally includes two branches, one branch is used for classifying the generated anchor frames, determining whether the anchor frames represent a foreground or a background, that is, the anchor frames do not include a target, the foreground, that is, the anchor frames include a target, and the other branch is used for predicting a boundary regression parameter corresponding to the anchor, the RPN finally selects a plurality of anchors from all anchors according to the classification result corresponding to the anchor, and adjusts the position and the size of the selected anchor according to the boundary regression parameter, so as to obtain a plurality of candidate frames (pro-osal) on the original image scale, each candidate frame is an area on the original image, and the interested area pooling layer 460 performs quantization operations twice according to the candidate frames to obtain a mapping result of the target size corresponding to the candidate frame in the feature image, that is, the corresponding area feature vector 470; the language backbone 410 may employ a pre-trained BERT (Bidirectional Encoder Representation from Transformers, transducer-based bi-directional encoder characterization) model that labels each word in the text description and superimposes a corresponding position-embedded vector before the text description is input into the language backbone 410.

A batch includes a plurality of picture-text description pairs, for each of which a number of region feature vectors and a number of word embedding vectors can be extracted.

In step 620, a graphic description contrast penalty is determined from each region feature vector and each word embedding vector corresponding to the pair of picture-text descriptions in the same batch, the graphic description contrast penalty being used to learn the mapping relationship between the picture and the text description.

The graphic description contrast loss can align the regional feature vector and the word embedding vector, so that the model can learn the graphic mapping relation at the instance level.

In one embodiment of the present application, the graphic description contrast loss includes a matching loss of the pictures and the text descriptions in the same batch and a matching loss of the text descriptions and the pictures in the same batch, and the graphic description contrast loss is calculated according to the overall similarity between the pictures and the text descriptions.

Loss of contrast of graphic descriptionI.e. loss 4 shown in fig. 4./>The calculation can be performed by the following expression:

wherein I represents a picture, T represents a text description, r _i Region feature vector, w, representing the i-th region of a picture _j Representing a word embedding vector corresponding to a jth word in a textual description (e.g., "a young lady under an umbrella by a train"), n _I N is the number of the regional feature vectors corresponding to the single picture _T The number of embedded vectors for the words of the text description,<·,·>representing cosine similarity between regional feature vectors and word embedding vectors, which may be dot product of two vectors, B _I As a batch of pictures, B _T Is a textual description of a batch, thus a _ij The similarity of the jth word to the ith region is measured,<R ^I ,W ^T > _S for the overall similarity between the picture and the text description,for the matching loss of the pictures and the text descriptions in the same batch, namely, the loss that the pictures can be matched with the text descriptions correctly, < >>Loss of matching of text descriptions with pictures for the same batch, i.e.The text description can correctly match the loss of pictures.

In step 630, each picture in the detection dataset is input into the original target detection model, the region recommendation loss is determined according to the predicted anchor frame and the real labeling frame for generating the region, the detection classification loss is determined according to the output classification result and the corresponding category, the detection frame regression loss is determined according to the output detection frame prediction result and the corresponding real labeling frame, the region recommendation loss is used for measuring the accuracy of the predicted anchor frame, the detection classification loss is used for measuring the accuracy of the output classification result, and the detection frame regression loss is used for measuring the accuracy of the output detection frame prediction result.

Region recommended lossNamely loss 1 shown in fig. 4, which is used to measure the difference in position between the anchor box and the true annotation box generated by RPN 450; detecting Classification loss->I.e., loss 2 shown in fig. 4, which is used to measure the difference between the classification result output by the classification head 490 and the category in the labeling information; detection frame regression loss->I.e., loss 3 shown in fig. 4, which is used to measure the difference in position between the detection box predictions output by the regression head 480 and the true annotation box.

In step 640, the original target detection model is trained based on the teletext contrast loss, the region recommendation loss, the detection classification loss, and the detection frame regression loss.

The total losses involved in this stage are as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,total loss for training the original object detection model,/->Recommended loss for area, < >>To detect loss of classification->To detect frame regression loss, ++>Contrast loss is described for the graph.

Next, returning to step 310, the training is performed to obtain the target detection model based on the trained original target detection model (i.e., the original training frame after the preliminary training). Specifically, a target training frame is required to be constructed according to the original training frame after preliminary training, and then a target detection model is constructed according to the target training frame after training. Since the initial training frame after the initial training includes the language backbone network 410, the visual backbone network 440, the RPN450, the region of interest pooling layer 460, the regression header 480, and the classification header 490 shown in fig. 4, the context feature fusion module 430 including a plurality of multi-headed attention layers is added after the region of interest pooling layer 460 and the language backbone network 410 on the basis of the initial training frame after the initial training, thereby obtaining the target training frame.

Because the graphic data set lacks the real corresponding relation between the region frame and the noun vocabulary, the corresponding relation only by using model reasoning contains larger mapping noise, which is not beneficial to accurately learning the graphic mapping relation. Thus, embodiments of the present application incorporate a context feature fusion module 430 to correct noisy mappings.

The text description includes a plurality of words, and one or more nouns may exist in all words, and the nouns may be used to refer to corresponding objects in the picture. Mask the mask object of the text description as a masked noun, i.e., a word in the text description that refers to the target. Each time a textual description is masked, one or more nouns in the textual description may be masked, resulting in a masked textual description.

FIG. 7 illustrates an architecture diagram for training with the introduction of a contextual feature fusion module according to one embodiment of the present application. Referring to fig. 4 and 7, masking is performed by replacing one or more nouns in the text description with a [ -MASK ] tag, resulting in a post-MASK text description. Masked nouns in fig. 7 include girl, forest, and umbrella.

In step 320, region feature vectors corresponding to the regions of the picture are extracted, respectively, and a set of word embedding vectors corresponding to the post-mask text descriptions are extracted, respectively.

Region feature vectors are extracted through visual backbone 440, RPN450, and region of interest pooling layer 460, and a set of word embedding vectors corresponding to each masked text description is extracted through language backbone 410, which may include a plurality of word embedding vectors.

In step 330, a multi-headed self-attention fusion operation is performed according to at least a portion of the feature vectors of each region and the embedded vectors of each group of words to predict the predicted features of the masked nouns.

The context feature fusion module 430 may specifically include multiple layers of a multi-head attention mechanism based transducer layer, for example, it includes 6 layers of an 8-head attention mechanism based transducer layer.

The context feature fusion module 430 may perform a multi-headed self-attention fusion operation on at least a portion of each regional feature vector and a set of word-embedded vectors, and output the word-embedded vectors of the predicted words, i.e., the predicted features of the masked nouns.

FIG. 8 shows a flowchart of the details of step 330 in the embodiment of FIG. 3, according to one embodiment of the application. Referring to fig. 8, the multi-head self-attention fusion operation is performed according to at least a part of the feature vectors of each region and each word embedding vector, and may specifically include the following steps:

In step 331, a predetermined number of regional feature vectors are selected from the plurality of regional feature vectors according to the similarity between the word embedding vector of the masked noun and each regional feature vector.

Can be according to a in the previous embodiment _ij And calculating the similarity between the word embedding vector of the mask noun and the feature vector of each region. The predetermined number of regional feature vectors with the corresponding maximum similarity can be screened out. The predetermined number may be set as desired, for example, may be set to 100.

In step 332, a multi-headed self-attention fusion operation is performed based on each of the selected regional feature vectors and each of the sets of word embedding vectors.

Finally input to the context feature fusion module 430 of the transducer layer based on the 8-head attention mechanism comprising 6 layers isr _i Region feature vector, w, representing the i-th region of the screened picture _j And representing a word embedding vector corresponding to the jth word in the text description.

In step 340, text mask loss is determined based on the predicted features and masked nouns, and a target detection model is obtained based on the text mask loss training, where the text mask loss is used to measure the difference between the predicted features and masked nouns.

Text mask penalty, i.e., penalty 6 shown in fig. 4, i.e., MLM in fig. 7, may be a cross entropy penalty, which is expressed as:

Wherein, the liquid crystal display device comprises a liquid crystal display device,for the predicted features of the context feature fusion module 430 on the masked nouns, w _j The vectors are embedded for words corresponding to the masked nouns.

Through the context feature fusion module 430 and text mask penalty, instance-level visual-text correspondence may be more accurately learned.

FIG. 9 shows a flowchart of the details of steps 330 and 340 in the embodiment of FIG. 3, according to one embodiment of the application. Referring to fig. 9, the multi-head self-attention fusion operation is performed according to at least a part of feature vectors of each region and each word embedding vector to predict and obtain the predicted feature of the masked noun, which may specifically include the following steps:

in step 330', a multi-headed self-attention fusion operation is performed according to at least a portion of each regional feature vector and each set of word embedding vectors to predict the predicted features of the masked nouns, and an attention activation value between the predicted features of each masked noun and each regional feature vector is determined.

In the last layer of the transducer layer of the context feature fusion module, the attention activation value between the predicted feature and the regional feature vector of the masked noun is calculated as follows:

wherein W is _q For a query matrix learned in the transducer layer, W _k For a matrix of keys learned in the transducer layer,r, which is the predicted feature of the masked noun by the context feature fusion module 430 _i Is regional feature vector, ++>Is the attention activation value between the predicted feature of the masked noun and the regional feature vector.

Training according to the text mask loss to obtain a target detection model, wherein the method specifically comprises the following steps of:

in step 340', a target variability loss is determined according to the attention activation value, and a target detection model is obtained according to the target variability loss and the text mask loss training, wherein the target variability loss is used for enhancing the attention of the word embedding vector to a target region matched with the word embedding vector and weakening the attention to other regions except the target region in the multiple regions.

Target differential lossI.e. loss 5 shown in fig. 4, i.e. target variability limit (Object divergence Constraint) in fig. 7, can be calculated by the following formula:

wherein [ (S)] ₊ Representing the non-negative number, w _j And w _k Is a word embedding vector corresponding to two different masked nouns.

As can be seen from fig. 7, the attention activation value between a masked noun pair and the region that it matches is significantly larger, and the attention activation value between a masked noun pair and the region that it does not match is smaller.

Because the attention activation values of different masked nouns to the same area should be different, namely, there is a great difference, the above formula makes a difference between the attention activation values of two different masked nouns to the same area, and then 0.5 is subtracted from the difference, so that a small target difference loss can be obtained; if the attention activation values of different masked nouns to the same region are similar, the difference between the attention activation values of two different masked nouns to the same region is small, and the target diversity loss is close to 0.5, so that a large target diversity loss is obtained, and further optimization is needed.

Through the target difference loss, the context feature fusion module can pay attention to different context information for different mask words.

Finally, at this stage, the total penalty introduced by the context feature fusion module 430 is:

wherein, the liquid crystal display device comprises a liquid crystal display device,is the total loss, i.e., feature fusion based on target variability constraints and MLM (Mask Language Modeling) loss.

After training of the target training framework is completed, the language backbone network 410 and the context feature fusion module 430 contained therein may be removed to obtain a target detection model.

In step 350, the image to be detected is subject to target detection based on the target detection model.

And inputting the picture to be detected into the target detection model, and obtaining a detection result of the picture to be detected, which is output by the target detection model, wherein the detection result comprises a detection frame and a class of the target positioned in the detection frame, and the class belongs to the class to be detected.

In one embodiment of the application, the object detection model is a class of output objects by a classification head, the classification head being a fixed first text embedding matrix, the first text embedding matrix being pre-trained.

The first text embedding matrix may include embedding vectors corresponding to a plurality of categories. The names of the categories can be converted into corresponding embedded vectors through a pre-trained text encoder, so that a first text embedded matrix is formed, the names of the categories can be respectively converted into sentences containing the names of the categories, and then the sentences are respectively converted into the corresponding embedded vectors through the text encoder.

In one embodiment of the present application, the object detection of the picture to be detected based on the object detection model is performed in a case where the first text embedding matrix matches the category to be detected.

If some kinds of targets need to be detected from the picture to be detected, the classification head of the target detection model needs to be matched with the kind to be detected.

In one embodiment of the present application, the object detection method further includes: and under the condition that the first text embedding matrix is not matched with the category to be detected, replacing the first text embedding matrix of the target detection model with a second text embedding matrix matched with the category to be detected so as to carry out target detection based on the new target detection model.

That is, the text embedding matrix is set according to the type to be detected, and what type to be detected is set.

In one embodiment of the application, the categories that need to be detected are different from the categories in the annotation information of the detection dataset.

In the embodiment of the application, the model is trained by combining the image-text data set, so that the object detection model can detect objects in other categories except the category of the detection data set.

The category of the object may be predicted based on the first text-embedding matrix by: and determining cosine similarity between each regional feature vector of the detected picture and each embedded vector in the first text embedded matrix, and taking the category corresponding to the embedded vector with the largest cosine similarity of the regional feature vector as the category of the target in the region corresponding to the regional feature vector.

The comparison result of the accuracy performance of the detector on the two data sets of the MS COCO and the LVIS obtained through training based on the method provided by the embodiment of the application compared with the accuracy performance of the two data sets of other mainstream OVD methods is shown in table 1:

TABLE 1

Where Supervised (Base) refers to a detector trained with detection data of only the base (base) class of framed labels. Novel refers to a new category index without a frame label, all refers to the overall index of the base category and the new category. Model accuracy is typically assessed using AP50 on COCO, while LVIS is assessed using mask MAP. The accuracy of this scheme is located in the last row of table 1. As shown in table 1, the solution can optimize the open set detection accuracy by designing the MLM training mode related to detection.

Therefore, experiments show that when the same graphic data set and the same target detection base class data are used, the DCK-OVD provided by the embodiment of the application can obtain better performance on new classes.

In summary, according to the object detection method provided by the embodiment of the present application, a new OVD frame is provided: DCK-OVD. Firstly, extracting regional feature vector-word embedded vector data pairs at the level of an instance object from an image-text data set, so that a target detection model preliminarily learns the visual-text corresponding relation at the level of the instance. And then, the regional feature vectors of all the examples and the text description of the mask word of a certain example are simultaneously input into a context feature fusion module, so that the context feature fusion module predicts which example word the text of the mask corresponds to. Through the fusion module modeling of the generation formula, the model can learn a more accurate instance-level vision-text corresponding relation from text context information, image space context information and global context corresponding relation, so that the overall performance of the model is improved, and the generalization of the model is improved. Meanwhile, in order to enable the model to pay attention to different image areas when masking different example vocabularies, the embodiment of the application introduces differentiation loss and further improves the precision of the model. Experiments show that the method provided effectively improves the detection precision of the model in the new category, and the obtained performance is superior to the current most advanced OVD method. In a word, the scheme of the embodiment of the application improves the precision of the target detection task, and has better detection capability for new categories without detection frame labels.

The following describes an embodiment of the apparatus of the present application, which may be used to perform the object detection method in the above-described embodiment of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the above-mentioned embodiments of the target detection method of the present application.

Fig. 10 shows a block diagram of an object detection device according to an embodiment of the application.

Referring to fig. 10, an object detection apparatus 1000 according to an embodiment of the present application includes: an acquisition unit 1010, an extraction unit 1020, a self-attention fusion unit 1030, a model training unit 1040, and a target detection unit 1050. The obtaining unit 1010 is configured to obtain a picture and a text description corresponding to the picture, and perform masking on the text description at least once to obtain a masked text description corresponding to each masking, where a masking object of the text description is a masked noun; the extracting unit 1020 is configured to extract region feature vectors corresponding to a plurality of regions of the picture, and extract a set of word embedding vectors corresponding to each of the masked text descriptions; the self-attention fusion unit 1030 is configured to perform a multi-headed self-attention fusion operation according to at least a part of the feature vectors of each region and each word embedding vector group, so as to predict and obtain predicted features of the masked nouns; the model training unit 1040 is configured to determine a text mask loss according to the predicted feature and the masked noun, and obtain a target detection model according to the text mask loss training, where the text mask loss is used to measure a difference between the predicted feature and the masked noun; the target detection unit 1050 is configured to perform target detection on a picture to be detected based on the target detection model.

In some embodiments of the present application, based on the foregoing scheme, the self-attention fusion unit 1030 is configured to: performing multi-head self-attention fusion operation according to at least one part of each regional characteristic vector and each group of word embedding vectors to predict and obtain predicted characteristics of mask nouns, and determining attention activation values between the predicted characteristics of the mask nouns and the regional characteristic vectors; the model training unit 1040 is configured to: and determining target diversity loss according to the attention activation value, and training according to the target diversity loss and the text mask loss to obtain a target detection model, wherein the target diversity loss is used for enhancing the attention of the word embedding vector to a target area matched with the word embedding vector and weakening the attention of other areas except the target area in the plurality of areas.

In some embodiments of the present application, based on the foregoing scheme, the self-attention fusion unit 1030 is configured to: according to the similarity between the word embedding vector of the mask noun and each regional feature vector, a preset number of regional feature vectors are selected from the regional feature vectors; and performing multi-head self-attention fusion operation according to the screened regional feature vectors and each group of word embedding vectors.

It should be noted that, the computer system 1100 of the electronic device shown in fig. 11 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 11, the computer system 1100 includes a central processing unit (Central Processing Unit, CPU) 1101 that can perform various appropriate actions and processes, such as performing the method described in the above embodiment, according to a program stored in a Read-Only Memory (ROM) 1102 or a program loaded from a storage section 1108 into a random access Memory (Random Access Memory, RAM) 1103. In the RAM 1103, various programs and data required for system operation are also stored. The CPU 1101, ROM 1102, and RAM 1103 are connected to each other by a bus 1104. An Input/Output (I/O) interface 1105 is also connected to bus 1104.

The following components are connected to the I/O interface 1105: an input section 1106 including a keyboard, a mouse, and the like; an output portion 1107 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker; a storage section 1108 including a hard disk or the like; and a communication section 1109 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 1109 performs communication processing via a network such as the internet. The drive 1110 is also connected to the I/O interface 1105 as needed. Removable media 1111, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in drive 1110, so that a computer program read therefrom is installed as needed in storage section 1108.

In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1109, and/or installed from the removable media 1111. When executed by a Central Processing Unit (CPU) 1101, performs the various functions defined in the system of the present application.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Where each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

As an aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer-readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the methods described in the above embodiments.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.

It will be appreciated that in particular embodiments of the present application, where data relating to training and reasoning about target detection models is involved, user approval or consent is required when the above embodiments of the present application are applied to particular products or technologies, and the collection, use and processing of the relevant data is required to comply with relevant laws and regulations and standards of the relevant country and region.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of target detection, the method comprising:

acquiring a picture and a text description corresponding to the picture, masking the text description at least once to obtain a masked text description corresponding to each masking, wherein mask objects of the text description are masked nouns;

Extracting regional feature vectors corresponding to a plurality of regions of the picture respectively, and extracting a group of word embedding vectors corresponding to each of the masked text descriptions respectively;

performing multi-head self-attention fusion operation according to at least one part of each regional feature vector and each group of word embedding vectors so as to predict and obtain predicted features of mask nouns;

determining text mask loss according to the predicted features and the masked nouns, and training according to the text mask loss to obtain a target detection model, wherein the text mask loss is used for measuring the difference between the predicted features and the masked nouns;

and carrying out target detection on the picture to be detected based on the target detection model.

2. The method of claim 1, wherein performing a multi-headed self-attention fusion operation based on at least a portion of each regional feature vector and each word-embedding vector to predict a predicted feature of a masked noun comprises:

performing multi-head self-attention fusion operation according to at least one part of each regional characteristic vector and each group of word embedding vectors to predict and obtain predicted characteristics of mask nouns, and determining attention activation values between the predicted characteristics of the mask nouns and the regional characteristic vectors;

The training according to the text mask loss to obtain a target detection model comprises the following steps:

and determining target diversity loss according to the attention activation value, and training according to the target diversity loss and the text mask loss to obtain a target detection model, wherein the target diversity loss is used for enhancing the attention of the word embedding vector to a target area matched with the word embedding vector and weakening the attention of other areas except the target area in the plurality of areas.

3. The object detection method according to claim 1, wherein before masking the text description at least once, the method further comprises:

acquiring a detection data set comprising a plurality of detection samples, wherein the detection samples comprise pictures and labeling information corresponding to targets in the pictures, and the labeling information comprises real labeling frames and categories;

acquiring an image-text data set comprising a plurality of image-text description pairs, wherein the image-text description pairs comprise images and text descriptions corresponding to the images;

and training an original target detection model based on the detection data set and the image-text data set, wherein the target detection model is obtained by training the trained original target detection model.

4. A method of object detection as claimed in claim 3 wherein said training an original object detection model based on said detection dataset and said teletext dataset comprises:

inputting the picture-text description in the picture-text data set into the original target detection model in batches to extract a plurality of regional feature vectors of pictures in the picture-text description pair of each batch and a plurality of word embedding vectors of the text description in the picture-text description pair of each batch;

determining graph description contrast loss according to each region feature vector and each word embedding vector corresponding to the graph-text description pairs in the same batch, wherein the graph description contrast loss is used for learning the mapping relation between the graph and the text description;

respectively inputting each picture in the detection data set into the original target detection model, determining region recommendation loss according to a predicted anchor frame and a real annotation frame for generating a region, determining detection classification loss according to an output classification result and a corresponding category, and determining detection frame regression loss according to an output detection frame prediction result and a corresponding real annotation frame, wherein the region recommendation loss is used for measuring the accuracy of the predicted anchor frame, the detection classification loss is used for measuring the accuracy of the output classification result, and the detection frame regression loss is used for measuring the accuracy of the output detection frame prediction result;

And training an original target detection model according to the graphic description comparison loss, the region recommendation loss, the detection classification loss and the detection frame regression loss.

5. The method of claim 1, wherein the object detection model outputs the class of objects by a classification head, the classification head being a fixed first text-embedding matrix, the first text-embedding matrix being pre-trained.

6. The object detection method according to claim 5, wherein the object detection of the picture to be detected based on the object detection model is performed in a case where the first text embedding matrix matches a category to be detected.

7. The method according to claim 6, wherein the category to be detected is different from the category in the annotation information of the detection dataset.

8. The method according to any one of claims 1 to 7, wherein the performing a multi-headed self-attention fusion operation based on at least a part of the feature vectors of each region and the embedded vectors of each group of words comprises:

according to the similarity between the word embedding vector of the mask noun and each regional feature vector, a preset number of regional feature vectors are selected from the regional feature vectors;

And performing multi-head self-attention fusion operation according to the screened regional feature vectors and each group of word embedding vectors.

9. The method according to claim 4, wherein the graphic description contrast loss includes a loss of matching of the pictures with the text descriptions in the same batch and a loss of matching of the text descriptions with the pictures in the same batch, the graphic description contrast loss being calculated according to an overall similarity between the pictures and the text descriptions.

10. An object detection device, the device comprising:

the device comprises an acquisition unit, a mask processing unit and a mask processing unit, wherein the acquisition unit is used for acquiring a picture and a text description corresponding to the picture, masking the text description at least once to obtain a masked text description corresponding to each masking, and masking objects of the text description are masked nouns;

the extraction unit is used for extracting regional feature vectors respectively corresponding to a plurality of regions of the picture, and extracting a group of word embedding vectors respectively corresponding to each of the masked text descriptions;

the self-attention fusion unit is used for carrying out multi-head self-attention fusion operation according to at least one part of the feature vectors of each region and each group of word embedding vectors so as to predict and obtain the predicted features of the mask nouns;

The model training unit is used for determining text mask loss according to the predicted features and the masked nouns, training according to the text mask loss to obtain a target detection model, and measuring the difference between the predicted features and the masked nouns;

and the target detection unit is used for carrying out target detection on the picture to be detected based on the target detection model.

11. A computer readable medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the object detection method according to any one of claims 1 to 9.

12. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the target detection method of any of claims 1 to 9.

13. A computer program product, characterized in that the computer program product comprises computer instructions stored in a computer-readable storage medium, from which computer instructions a processor of a computer device reads, the processor executing the computer instructions, causing the computer device to perform the object detection method according to any one of claims 1 to 9.