CN117115584A

CN117115584A - Target detection method, device and server

Info

Publication number: CN117115584A
Application number: CN202311061664.4A
Authority: CN
Inventors: 田欣; 崔振伟; 刘建; 苏洪全; 郭月明; 马庆; 李晶晶; 冯连威; 张丹和; 耿刚强
Original assignee: Kunlun Digital Technology Co ltd; China National Petroleum Corp
Current assignee: Kunlun Digital Technology Co ltd; China National Petroleum Corp
Priority date: 2023-08-22
Filing date: 2023-08-22
Publication date: 2023-11-24

Abstract

The specification provides a target detection method, a target detection device and a server. Based on the method, before specific implementation, a large number of point marked sample images, a small number of full marked sample images and a sample transformation model at least comprising a point amplification module can be utilized according to a weak semi-supervised learning rule based on point marking, and a target detection model with higher precision and better effect can be obtained through training with lower cost by weak semi-supervised learning; in the implementation, whether a target object exists in the target image can be accurately determined by processing the acquired target image by utilizing the target detection model; and further determining the position information and the object type of the target object in the target image under the condition that the target object exists in the target image. Thus, the target detection for the image data can be efficiently and accurately realized with relatively low processing cost.

Description

Target detection method, device and server

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a target detection method, a device, and a server.

Background

In many application scenarios (e.g., traffic road supervision scenarios, cell security scenarios, oilfield construction scenarios, etc.), it is often necessary to target the monitored acquired images with corresponding detection models to determine whether there is a corresponding security risk.

However, based on the existing method, a large amount of full labeling samples are needed to be obtained by consuming a large amount of labeling cost, and then the corresponding detection model is obtained by training the full labeling samples, so that specific target detection can be performed on the image by using the detection model. When the method is implemented, the technical problems that model training is long in time consumption and high in training cost, and then the overall processing cost and the overall processing efficiency are affected often exist.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The specification provides a target detection method, a device and a server, which can efficiently and accurately realize target detection for image data with relatively low processing cost.

The specification provides a target detection method, comprising the following steps:

acquiring a target image;

processing the target image by using a target detection model to obtain a corresponding target processing result; the target detection model is obtained by weak semi-supervised learning training through a point-marked sample image, a full-marked sample image and a sample transformation model according to a weak semi-supervised learning rule based on point marking in advance; the sample modification model at least comprises: a dot amplification module;

Determining whether a target object exists in the target image according to the target processing result; and determining the position information and the object type of the target object in the target image under the condition that the target object exists in the target image.

In one embodiment, before acquiring the target image, the method further comprises:

acquiring a point labeling sample image and a full labeling sample image according to a weak semi-supervised learning rule based on point labeling; the number of the point labeling sample images is larger than that of the full labeling sample images; the full-labeling sample image at least comprises a labeling frame for selecting a sample object and a type label carried by the labeling frame and aiming at the sample object; the point labeling sample image at least comprises a labeling point used for indicating a sample object and a type label carried by the labeling point and aiming at the sample object;

processing the point labeling sample image by using a sample transformation model to obtain a processed point labeling sample image carrying a pseudo labeling frame;

combining the full-label sample image and the processed point label sample image to obtain a training sample set;

and training the initial detection model by using the training sample set to obtain a target detection model meeting the requirements.

In one embodiment, the sample modification model further comprises: the system comprises a feature extraction network, a transducer module, a point encoder and a prediction head module;

wherein the feature extraction network is connected with the transducer module; the point amplification module is also connected with a point encoder; the point encoder is also connected with the transducer module; the transducer module is also connected with the prediction head module;

the prediction head module is used for determining and marking a pseudo marking frame for framing a sample object in a point marking sample image of an input sample transformation model based on intermediate result data output by the transducer module so as to output a corresponding processed point marking sample image carrying the pseudo marking frame.

In one embodiment, the feature extraction network is used for extracting features of the point labeling sample image input into the sample transformation model to obtain a corresponding sample feature map; the sample feature map is sent to a transducer module;

the point amplification module is used for receiving the high-dimensional characteristics of the point labeling sample image output by the characteristic extraction network; in the high-dimensional characteristics of the point labeling sample image, acquiring high-dimensional characteristic information at the labeling point through bilinear interpolation sampling; according to the high-dimensional characteristic information at the marking point, determining four auxiliary points which are associated with the marking point and positioned in four different directions through position deviation prediction; and setting the type labels carried by the marking points on the four auxiliary points respectively to obtain the amplified pseudo marking points.

In one embodiment, the point encoder includes: an improved point encoder; wherein the improved point encoder comprises at least: an absolute position branch processing structure, a relative position branch processing structure, a category index branch processing structure, and a semantic alignment branch processing structure;

the point encoder is used for marking points in the sample image through the processing points and obtaining and outputting corresponding joint codes through the amplified pseudo marking points.

In one embodiment, the point encoder processes the points to label the labeled points in the sample image and the amplified pseudo-labeled points to obtain and output a corresponding joint code in the following manner:

respectively carrying out preset coding processing according to the position coordinates of the marking points and the position coordinates of the amplified pseudo marking points through an absolute position branch processing structure to obtain corresponding absolute position codes;

determining and marking the offset distances of the upper boundary, the lower boundary, the left boundary and the right boundary of the sample image according to the position coordinates of the marking points and the position coordinates of the amplified pseudo marking points and the amplified pseudo marking points relative to the point through a relative position branch processing structure, and generating corresponding relative position vectors; converting the relative position vector into corresponding relative ratio data; carrying out preset coding treatment on the relative ratio data to obtain corresponding relative position codes;

Processing the type label of the marking point and the type label of the amplified pseudo marking point through a category index branch processing structure to obtain a corresponding category index code;

respectively carrying out linear mapping on the high-dimensional characteristics of the obtained marking points and the high-dimensional characteristics of the amplified pseudo marking points through a semantic alignment branch processing structure to obtain corresponding image semantic characteristic codes; the image semantic feature codes, the absolute position codes, the relative position codes and the category index codes have the same dimension;

and combining the absolute position code, the relative position code, the category index code and the image semantic feature code to obtain corresponding joint codes.

In one embodiment, the absolute position encoding, the relative position encoding, the category index encoding, and the image semantic feature encoding are used in combination to obtain corresponding joint encodings, including:

performing addition operation by using the absolute position code, the relative position code and the category index code to obtain an intermediate code;

and multiplying by using the intermediate code and the feature code to obtain the joint code.

In one embodiment, the transducer module comprises: a transducer encoder and a transducer decoder; wherein the transducer encoder is respectively connected with the feature extraction network and the transducer decoder; the transducer decoder is also connected with a point encoder and a prediction head module;

The transducer module is used for generating and outputting corresponding intermediate result data according to the input sample characteristic diagram, the position code and the joint code.

In one embodiment, the transducer module generates and outputs corresponding intermediate result data in a manner based on the input sample feature map, the position encoding, and the joint encoding:

processing the combination of the sample feature map and the position code by a transducer encoder to generate corresponding embedded data; and transmitting the embedded data to a transducer decoder;

obtaining a matched attention mask by a transducer decoder;

generating corresponding object query data by using the joint codes through a transducer decoder; and generating and outputting corresponding intermediate result data by using the attention mask, the object query data and the embedded data.

In one embodiment, the transducer decoder comprises: a decoding layer; wherein, the decoding layer includes: a self-attention layer, a cross-attention layer, and a feedforward neural network layer;

the self-attention layer is used for processing the object query data by using an attention mask so as to strengthen interaction between the object query data indicating the same sample object;

The cross attention layer is used for refining the processed object query data output by the self attention layer by utilizing the embedded data so as to obtain refined object query data;

and the feedforward neural network layer is used for predicting a bounding box containing the sample object in the point labeling sample image according to the refined object query data to obtain corresponding intermediate result data.

In one embodiment, the sample modification model is further connected with a matching loss module; the matching loss module is configured with a matching loss function and a denoising loss function.

In one embodiment, generating corresponding object query data using the joint encoding by a transducer decoder includes:

acquiring noise data based on type tags;

generating corresponding matching data according to the object query data;

combining the noise data and the matching data to obtain the object query data;

accordingly, the matched attention mask includes: a denoising mask region corresponding to the noise data, and a matching mask region corresponding to the matching data.

In one embodiment, the sample modification model is trained as follows:

Acquiring and utilizing a full-label sample image according to a weak semi-supervised learning rule based on point labeling, and generating first sample data;

constructing an initial reconstruction model; wherein the initial reconstruction model comprises at least an initial point amplification module;

and training the initial reconstruction model by using the first sample data to obtain a sample reconstruction model meeting the requirements.

The present specification also provides an object detection apparatus including:

the acquisition module is used for acquiring a target image;

the processing module is used for processing the target image by utilizing a target detection model to obtain a corresponding target processing result; the target detection model is obtained by weak semi-supervised learning training through a point-marked sample image, a full-marked sample image and a sample transformation model according to a weak semi-supervised learning rule based on point marking in advance; the sample modification model at least comprises: a dot amplification module;

the determining module is used for determining whether a target object exists in the target image according to the target processing result; and determining the position information and the object type of the target object in the target image under the condition that the target object exists in the target image.

The present specification also provides a server comprising a processor and a memory for storing processor-executable instructions which, when executed by the processor, implement the relevant steps of the target detection method.

Based on the target detection method, the device and the server provided by the specification, before specific implementation, a large number of point marked sample images, a small number of full marked sample images and a sample transformation model at least comprising a point amplification module can be utilized according to a weak semi-supervised learning rule based on point marking, and the target detection model with higher accuracy and better effect can be obtained through weak semi-supervised learning by training at lower cost; in the specific implementation, whether a target object exists in the target image can be accurately determined by processing the target image by utilizing the target detection model; and further determining the position information and the object type of the target object in the target image under the condition that the target object exists in the target image. According to the weak semi-supervised learning rule based on the point labeling, the weak semi-supervised learning is performed by adopting a sample transformation model at least comprising a point amplification module, and only a small number of full-labeling sample images are required to be acquired and used, and the target detection model with higher precision and better effect can be quickly trained by mainly relying on and using the point labeling sample images with lower labeling cost, so that the overall processing cost can be effectively reduced; moreover, the target detection model trained based on the above manner can be used to efficiently and accurately realize target detection for image data.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure, the drawings that are required for the embodiments will be briefly described below, and the drawings described below are only some embodiments described in the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a schematic flow chart of a target detection method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of one embodiment of an object detection method provided by embodiments of the present disclosure, in one example scenario;

FIG. 3 is a schematic diagram of one embodiment of an object detection method provided by embodiments of the present disclosure, in one example scenario;

FIG. 4 is a schematic view of a composition of a sample modification model according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of an embodiment of an object detection method provided by embodiments of the present disclosure, in one example scenario;

FIG. 6 is a schematic diagram of one embodiment of an object detection method provided by embodiments of the present disclosure, in one example scenario;

FIG. 7 is a schematic diagram of an embodiment of an object detection method provided by embodiments of the present disclosure, in one example scenario;

FIG. 8 is a schematic diagram of another component structure of a sample modification model provided in one embodiment of the present disclosure;

FIG. 9 is a schematic diagram of an embodiment of an object detection method provided by embodiments of the present disclosure, in one example scenario;

FIG. 10 is a schematic diagram of an embodiment of an object detection method provided by embodiments of the present disclosure, in one example scenario;

FIG. 11 is a schematic diagram of an embodiment of an object detection method provided by embodiments of the present disclosure, in one example scenario;

FIG. 12 is a schematic view of still another component structure of the sample modification model provided in one embodiment of the present disclosure;

FIG. 13 is a schematic diagram of an embodiment of an object detection method provided by embodiments of the present disclosure, in one example scenario;

FIG. 14 is a schematic diagram of an embodiment of an object detection method provided by embodiments of the present disclosure, in one example scenario;

FIG. 15 is a schematic diagram showing the structural composition of a server according to an embodiment of the present disclosure;

fig. 16 is a schematic structural diagram of an object detection device according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions in the present specification better understood by those skilled in the art, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

Referring to fig. 1, an embodiment of the present disclosure provides a target detection method, where the method is specifically applied to a server side. In particular implementations, the method may include the following:

s101: acquiring a target image;

s102: processing the target image by using a target detection model to obtain a corresponding target processing result; the target detection model is obtained by weak semi-supervised learning training through a point-marked sample image, a full-marked sample image and a sample transformation model according to a weak semi-supervised learning rule based on point marking in advance; the sample modification model at least comprises: a dot amplification module;

s103: determining whether a target object exists in the target image according to the target processing result; and determining the position information and the object type of the target object in the target image under the condition that the target object exists in the target image.

In some embodiments, the above-mentioned target image may be specifically understood as image data of whether or not the target object is present to be detected. Specifically, the target image may be a photograph acquired by a camera, or may be an image frame taken from a video. The target objects may be different types of target objects for different application scenarios.

Specifically, for example, referring to fig. 2, in the oilfield construction scene, the target image may specifically be a field image collected at the oilfield construction site; accordingly, the object to be detected may be a risk object in the oilfield construction site, which may have a potential safety hazard, such as a worker who is performing an illegal construction operation in the oilfield construction site, which has a potential safety hazard, or an abnormal fire point, etc.

In the specific implementation, the field image aiming at the oilfield construction site can be acquired in real time or at fixed time by a monitoring camera deployed at the oilfield construction site to serve as a target image to be detected; and then the target image is sent to a server in a wired or wireless mode. The server can perform target detection on the target image by using a target detection model obtained in advance based on weak semi-supervised learning training so as to determine whether a risk object exists in the target image. In the case that the risk object exists in the target image, the server can further determine the position information and the risk type of the risk object; generating corresponding risk prompt information according to the position information and the risk type of the risk object; and then the risk prompt information is sent to a user terminal held by a safety inspection personnel at the oil field construction site for prompt. Correspondingly, the safety inspection personnel can go to the corresponding position in time to confirm the risk according to the risk prompt information, and perform targeted treatment according to the risk type, so that potential safety hazards are eliminated in time, and the construction safety of the oilfield construction site is ensured.

The server specifically comprises a background server which is applied to one side of an oilfield construction safety management center and can realize functions of data transmission, data processing and the like. Specifically, the server may be, for example, an electronic device having data operation, storage function and network interaction function. Alternatively, the server may be a software program running in the electronic device that provides support for data processing, storage, and network interactions. In the present embodiment, the number of servers is not particularly limited. The server may be one server, several servers, or a server cluster formed by several servers.

The user terminal specifically comprises a front end which is applied to one side of a security patrol personnel and can realize functions of data acquisition, data transmission and the like. Specifically, the user terminal may be, for example, an electronic device such as a tablet computer, a notebook computer, a smart phone, and the like. Alternatively, the user terminal may be a software application capable of running in the electronic device. For example, it may be a security management center client APP running on a smart phone, etc.

For another example, in the traffic road supervision scenario, the target image may specifically be a vehicle image that travels on a road collected by a traffic road monitoring system; correspondingly, the object to be detected may specifically be a vehicle object running on a road with a driving behavior at risk of violating regulations.

Of course, it should be noted that the above-listed application scenarios, the target image and the target object are only illustrative. In the implementation, the target detection method can be applied to other application scenes according to specific conditions and requirements, and the related target images and target objects can also comprise target images and target objects in other application scenes. The present specification is not limited to this.

In some embodiments, the target detection model may specifically include a neural network model that is obtained by training according to weak semi-supervised learning rules based on point labeling in advance, by introducing and using a sample reconstruction model that includes at least a point amplification module, using a large number of point-labeled sample images and a small number of full-labeled sample images, and performing target object detection on input image data automatically through weak semi-supervised learning.

The point labeling sample image may be specifically understood as a sample image labeled with a labeling point for indicating a sample object and a type label of the sample object. The above-mentioned full-label sample image is specifically understood as a sample image labeled with a label box for box-selecting a sample object, and a type tag of the sample object. The cost of labeling a fully labeled sample image is typically higher than the cost of labeling a point labeled sample image.

Specifically, the point labeling sample image may specifically include one or more sample objects; and, a corresponding labeling point and type label are labeled for each sample object.

The sample transformation model can be specifically understood as an algorithm model which is constructed and trained according to weak semi-supervised learning rules based on point labeling, can label a sample image based on input points, predicts and automatically labels a bounding box (pseudo labeling box) capable of selecting a sample object.

Further, the sample modification model at least includes points: and a dot amplification module. The point amplification module can be specifically used for amplifying one marked point which originally corresponds to one sample object in the point marked sample image into a plurality of marked points. Furthermore, the sample transformation model can comprehensively utilize the information of the original marking points and the information of a plurality of amplified pseudo marking points, and relatively more comprehensive and rich image features are obtained through excavation, so that the pseudo marking frames for framing the sample objects can be predicted more accurately.

In this embodiment, according to the weak semi-supervised learning rule based on point labeling, the point labeling sample image with relatively low labeling cost can be automatically modified into the processed point labeling sample image with relatively good effect and carrying the pseudo labeling frame by introducing and using the sample modification model at least comprising the point amplification module; and then the processed point marked sample image and a small amount of full marked sample images can be utilized to efficiently train and obtain a target detection model with relatively high precision and relatively good effect through weak semi-supervised learning.

The weak semi-supervised learning rule based on point labeling can be specifically understood as an algorithm conception based on weak semi-supervised learning and a rule set of a point labeling sample image-based characteristic design, which relates to how to construct a training sample reconstruction model and how to train a target detection model by using the point labeling sample image and the full labeling sample image through the sample reconstruction model.

In some embodiments, referring to fig. 3, before the target image is acquired, the method may further include the following steps when implemented:

s1: acquiring a point labeling sample image and a full labeling sample image according to a weak semi-supervised learning rule based on point labeling; the number of the point labeling sample images is larger than that of the full labeling sample images; the full-labeling sample image at least comprises a labeling frame for selecting a sample object and a type label carried by the labeling frame and aiming at the sample object; the point labeling sample image at least comprises a labeling point used for indicating a sample object and a type label carried by the labeling point and aiming at the sample object;

S2: processing the point labeling sample image by using a sample transformation model to obtain a processed point labeling sample image carrying a pseudo labeling frame;

s3: combining the full-label sample image and the processed point label sample image to obtain a training sample set;

s4: and training the initial detection model by using the training sample set to obtain a target detection model meeting the requirements.

The type tag may specifically include a tag for indicating an object type of the sample object.

In the specific implementation, the sample transformation model can be utilized to process the point labeling sample image so as to automatically transform the point labeling sample image which only originally contains labeling points into a processed point labeling sample image which contains a pseudo labeling frame; and the processed point labeling sample image can be mixed with a small amount of full labeling sample images to obtain a training sample set for training the target detection model. Then, the training sample set can be utilized to obtain a target detection model meeting the requirements by carrying out weak semi-supervised learning on the initial detection model.

The weak semi-supervised learning can be specifically understood as a model training manner of model training learning by using a larger amount of weak annotation data (e.g., point annotation sample images) and a smaller amount of instance-level annotation data (e.g., full annotation sample images).

The initial detection model may specifically include an initial model based on an FCOS structure.

The FCOS (Fully Convolutional One-Stage Object Detection) may specifically refer to a model for detecting an Anchor-free, and based on the model, the class of the point on the feature map that is downsampled by S can be predicted first; the size and the position of the boundary frame are determined by four values l, r, t, b of the predicted point, so that the NMS can be helped to restrain a low-quality frame and further improve the performance of the network. Wherein l, r, t, b represents the distances between the point and the left, right, upper and lower boundaries of the bounding box, respectively.

The sample transformation model can be specifically an algorithm model which is constructed in advance according to weak semi-supervised learning rules based on point labeling and is trained by using full-labeling sample images.

In some embodiments, referring to fig. 4, the sample modification model may specifically further include: the system comprises a feature extraction network, a transducer module, a point encoder, a prediction head module and other structures;

In specific implementation, the feature extraction network can be specifically used for extracting features of point labeling sample images input into a sample transformation model to obtain corresponding sample feature images; and sending the sample feature map to a transducer module. In addition, the feature extraction network also sends the high-dimensional features of the point labeling sample image obtained by intermediate extraction when the sample feature image is obtained to the point amplification module.

The point amplification module can be particularly used for receiving the high-dimensional characteristics of the point labeling sample image output by the characteristic extraction network; in the high-dimensional characteristics of the point labeling sample image, acquiring high-dimensional characteristic information at the labeling point through bilinear interpolation sampling; obtaining a position deviation predicted value through position deviation prediction according to the high-dimensional characteristic information at the marked point; according to the coordinates of the original marking points, combining the position deviation predicted values, determining four auxiliary points which are associated with the marking points and positioned in four different directions; and setting the type labels carried by the marking points on the four auxiliary points respectively to obtain the amplified pseudo marking points.

Specifically, referring to fig. 5, the above-mentioned point amplification module may use an existing labeling point as a center, and find out four different auxiliary points located in different directions above left, above right, below left and below right of the labeling point through position deviation prediction, where the four different auxiliary points and the labeling point indicate the same sample object, and the four different auxiliary points are used as pseudo-labeling points after amplification. Therefore, the one-to-one correspondence of the original one marking point corresponding to one sample object can be expanded into the multi-to-one correspondence of five marking points (the existing marking point and the amplified pseudo marking point) corresponding to one sample object. And then the corresponding relation of the plurality of pairs of the first and the second can be utilized to replace the original corresponding relation of one to one, so that the boundary frame of the sample object can be accurately predicted by mining and utilizing more abundant and various marking point information.

The point encoder can be specifically used for labeling point information in a sample image according to points and amplifying pseudo-labeling point information, and corresponding joint codes are obtained through relevant coding processing.

The transducer module can be specifically used for labeling the joint codes and the sample feature graphs of the sample images according to the input points and obtaining output corresponding intermediate result data through corresponding processing.

The transducer module can reduce the distance between any two positions in an input data sequence to be a constant by introducing and using an Attention mechanism, so that the semantic association effect with longer capturing interval is relatively better when analyzing and predicting longer data; in addition, the transducer module is not similar to the RNN sequence structure, has better parallelism, accords with the existing GPU framework, can utilize the distributed GPU to carry out parallel training, and improves the model training efficiency.

The feature extraction network may specifically include: residual networks, such as the backbone network of ResNet 50. The ResNet50 backbone network has 50 layers in total, can be divided into five parts according to characteristic dimensions based on the network, the convolution kernel used by each part is different in size and number, characteristic diagrams with different dimensions and channel numbers are respectively output, and finally a 2048-dimensional characteristic diagram C5 can be output through the last part to serve as a corresponding sample characteristic diagram.

The prediction head module may specifically include: multilayer perceptrons (Multilayer Perceptron, MLP). Based on the multi-layer perceptron, the prediction head module can predict the boundary frame containing the sample object by utilizing the intermediate result data output by the transducer module, so as to determine and label the corresponding pseudo label frame.

In some embodiments, the point encoder may specifically include: an improved point encoder; wherein the improved point encoder comprises at least: an absolute position branch processing structure, a relative position branch processing structure, a category index branch processing structure, and a semantic alignment branch processing structure;

Specifically, referring to fig. 6, based on the improved point encoder, by additionally introducing and utilizing the relative position branch processing structure and the semantic alignment branch processing structure, the codes based on different dimensions can be fully and fully utilized to obtain the joint codes with higher fusion and richer information, and then the object query data with relatively better effect can be generated based on the joint codes. The object query data will be described in detail later. Wherein x and y respectively represent the abscissa and the ordinate of the marked point (including the marked point and the amplified pseudo marked point); c represents the type label of the marking point; w and h represent the width and height of the point marked sample image, respectively.

In some embodiments, based on the improved point encoder, referring to fig. 7, in implementation, the point encoder processes the labeling points in the sample image and the amplified pseudo labeling points to obtain and output corresponding joint codes in the following manner:

s1: respectively carrying out preset coding processing according to the position coordinates of the marking points and the position coordinates of the amplified pseudo marking points through an absolute position branch processing structure to obtain corresponding absolute position codes;

S2: determining and marking the offset distances of the upper boundary, the lower boundary, the left boundary and the right boundary of the sample image according to the position coordinates of the marking points and the position coordinates of the amplified pseudo marking points and the amplified pseudo marking points relative to the point through a relative position branch processing structure, and generating corresponding relative position vectors; converting the relative position vector into corresponding relative ratio data; carrying out preset coding treatment on the relative ratio data to obtain corresponding relative position codes;

s3: processing the type label of the marking point and the type label of the amplified pseudo marking point through a category index branch processing structure to obtain a corresponding category index code;

s4: respectively carrying out linear mapping on the high-dimensional characteristics of the obtained marking points and the high-dimensional characteristics of the amplified pseudo marking points through a semantic alignment branch processing structure to obtain corresponding image semantic characteristic codes; the image semantic feature codes, the absolute position codes, the relative position codes and the category index codes have the same dimension;

s5: and combining the absolute position code, the relative position code, the category index code and the image semantic feature code to obtain corresponding joint codes.

The preset encoding process may specifically refer to sin encoding.

In the specific implementation, when generating the absolute position code, the position coordinates of the labeling point and the amplified pseudo labeling point for the same sample object can be respectively subjected to sin coding through an absolute position branch processing structure, so that the 256-dimensional absolute position code corresponding to each labeling point is obtained.

When the relative position code is generated, according to the labeling points of the same sample object and the position coordinates of the amplified pseudo labeling points, the offset distance (top, abbreviated as t) of the upper boundary, the offset distance (down, abbreviated as d) of the lower boundary, the offset distance (left, abbreviated as l) of the left boundary and the offset distance (right, abbreviated as r) of the right boundary of the labeling sample image of each labeling point relative to the point where the labeling point is positioned can be respectively calculated through a relative position branch processing structure; and generating relative position vectors corresponding to the marking points respectively based on the offset distances, and marking the relative position vectors as follows: (top, down, left, right). Calculating the ratio of the offset distance relative to the upper boundary to the offset distance relative to the lower boundary and the ratio of the offset distance relative to the left boundary to the offset distance relative to the right boundary based on the relative position vectors; and the two ratios are utilized to construct and obtain relative ratio data corresponding to the marking points, and the relative ratio data is recorded as: (top/down, left/right). And finally, sin coding is carried out on the relative ratio data, so that corresponding 256-dimensional relative position codes are obtained.

When the class index codes are generated, the class index branch processing structure is used for processing the labeling points aiming at the same sample object and the type labels of the amplified pseudo labeling points to obtain 256-dimension class index codes corresponding to the labeling points.

When the image semantic feature codes are generated, the sample images are marked through the semantic alignment branch processing structure, the marking points aiming at the same sample object and the high latitude features of the amplified pseudo marking points are extracted, and after interpolation sampling, linear mapping is carried out to obtain 256-dimension image semantic feature codes corresponding to all the marking points.

Finally, the absolute position coding, the relative position coding, the category index coding and the image semantic feature coding are used in combination, so that the information contained in the different codes is fully fused, and the required joint codes are obtained.

In some embodiments, referring to fig. 6, the above combination uses the absolute position code, the relative position code, the category index code, and the image semantic feature code to obtain a corresponding joint code, which may include the following when implemented:

S1: performing addition operation by using the absolute position code, the relative position code and the category index code to obtain an intermediate code;

s2: and multiplying by using the intermediate code and the feature code to obtain the joint code.

Based on the embodiment, the five codes based on different layers can be effectively and comprehensively utilized, and the joint codes with relatively good effect can be obtained.

In some embodiments, referring to fig. 8, the transducer module may specifically include: a transducer Encoder (Encoder) and a transducer Decoder (Decoder); wherein the transducer encoder is respectively connected with the feature extraction network and the transducer decoder; the transducer decoder is also connected with a point encoder and a prediction head module;

The position code may specifically be positional encoding obtained based on point labeling of the sample image.

In some embodiments, referring to fig. 9 in conjunction with fig. 8, in implementation, the transform module may generate and output corresponding intermediate result data according to the input sample feature map, the position code, and the joint code in a manner:

S1: processing the combination of the sample feature map and the position code by a transducer encoder to generate corresponding embedded data (e.g., empeddings); and transmitting the embedded data to a transducer decoder;

s2: obtaining a matched Attention mask (e.g., attention mask) by a transducer decoder;

s3: corresponding object query data (e.g., object query) is generated by a transducer decoder using the joint encoding, and corresponding intermediate result data is generated and output using the attention mask, object query data, and embedded data.

The object query data (object query) may be specifically understood as a query object, which is a way for describing a target to be detected in a detection task. Specifically, an important structure in the transducer can convert the detection task into similarity calculation on the prediction result and the feature map based on the object query data. In Detr, each object query can be regarded as a representation of a target, typically generated by a transducer decoder, the vector representation of which can be regarded as an encoding of the class and position of the target. In the prediction, each object query is matched with the embedded data (for example, encoder embeddings) output by the transducer encoder, so as to determine which position in the embedded data the object query should be associated with, and further, accurate description of the target position and category can be obtained, so that target detection is realized.

The attention mask may specifically refer to fig. 10, and interaction between object query data of different annotation points for the same sample object can be enhanced by using the attention mask with the structure described above; meanwhile, interaction among object query data of different annotation points for different sample objects is avoided, so that a model can be relatively focused on the query data of a plurality of annotation points for the same sample object in a combined mode, one-to-many correspondence dug by a point augmentation module is fully utilized, and further a boundary box for selecting the corresponding sample object can be found out more accurately and rapidly.

In some embodiments, referring to fig. 11, the converter decoder may specifically include: a decoding layer; wherein, the decoding layer includes: self-attention layer, cross-attention layer, feedforward neural network layer, etc.; in addition, the above decoding layer may further include a structure such as an intermediate layer (e.g., add & norm). Wherein the box indicates intermediate result data for the prediction bounding box and the class indicates intermediate result data for the prediction type label. In actual use, only intermediate result data including boxes is output and used. The input decoder embeddings (tgt) can be understood as the copy data of the (object query) of the object query data. positional embeddings may specifically be a position feature extracted based on embedded data output by a transducer encoder.

The self-attention layer (e.g., self-attention) may be particularly useful for processing the object query data with an attention mask to enhance interactions between object query data indicative of the same sample object;

the cross attention layer (for example, cross-attention) may be specifically configured to refine the processed object query data output from the attention layer by using the embedded data, so as to obtain refined object query data;

the feedforward neural network layer (for example, FFN) may be specifically configured to predict a bounding box containing a sample object in a point labeling sample image according to the refined object query data, so as to obtain corresponding intermediate result data.

Wherein the transducer decoder may comprise a plurality of decoding layers. Intermediate result data output from the previous decoding layer may be input to the next decoding layer as object query data received by the next decoding layer. And the next decoding layer can further process the intermediate result data obtained by the previous decoding layer to obtain relatively more accurate updated intermediate result data, and then input the intermediate result data into the next decoding layer. And finally, outputting the intermediate result data with higher precision through a transducer decoder.

The feed forward neural network layer (Feed Forward Neural Network, FFN) is the simplest neural network, with the neurons being arranged in layers, each neuron being connected only to the neurons of the previous layer. Based on the network layer, the output of the previous layer can be received and output to the next layer, and feedback is not generated between layers.

In this embodiment, by constructing the self-attention layer and the cross-attention layer, two different attention mechanisms can be introduced in the transducer decoder to mix two different embedding sequences (e.g., embedded data and object query data). Wherein the two embedded sequences typically have the same dimensions, the two embedded sequences may be in different mode morphologies (e.g., text, image, audio). Specifically, a sequence may be used as the input Q, defining the length of the output sequence; another sequence is used to provide the input K and V. Wherein the inputs to the self-attention mechanism are typically from the same sequence, for a single process embedded sequence; whereas the cross-attention mechanism inputs are typically from different sequences for asymmetrically combining two different embedded sequences together for processing, e.g., one sequence for the query Q input and the other sequence for the key K and value V inputs.

In particular, referring to fig. 11, it can be seen that the transducer decoder may extract the position coding information p of the relevant annotation point according to the object query data _s The method comprises the steps of carrying out a first treatment on the surface of the Inputting object query data into the FFN layer for prediction to obtain content characteristic information T; meanwhile, according to the object query data, combining the position coding information p _s The method comprises the following steps: and inputting the initial V component, the initial K component and the initial Q component into a self-attention layer for corresponding processing, and outputting first object query data. Recombination utilizing embedded data, first object query data, and previously obtained position-coded information p _s Carrying out corresponding transportation on the content characteristic information T to obtain corresponding first V component, first K component and first Q component; and inputting the first V component, the first K component and the first Q component into a cross attention layer for processing so as to realize the refinement processing of the object query data by utilizing the embedded data and output and obtain second object query data with rich information. Finally, respectively inputting second object query data into the FFN layer for boundary frame prediction, and carrying out corresponding prediction processing on the FFN layer for object type prediction to obtain corresponding boundary frame prediction data and object type prediction data; and combining the boundary box prediction data and the object type prediction data to obtain and output intermediate result data.

In some embodiments, referring to fig. 12, the sample modification model may further have a matching loss module connected thereto; the matching loss module is configured with a matching loss function and a denoising loss function. In particular, the above-described matching loss model may be connected behind the prediction head module. Based on the above-mentioned matching loss module, through additionally introducing the label denoising task, and combine together the matching optimization of label denoising task and model, thereby can utilize the characteristic information correlated with label denoising to assist, so as to train more rapidly and obtain the sample transformation model that can export the bounding box automatically, meet the requirements.

In some embodiments, the transform decoder generates the corresponding object query data by using the joint encoding, and when implemented, the method may include the following:

s1: acquiring noise data (or denoising part) based on the type tag;

s2: generating corresponding matching data (or matching part) according to the object query data;

s3: combining the noise data and the matching data to obtain the object query data;

accordingly, the matched attention mask may specifically include: a denoising mask region corresponding to the noise data, and a matching mask region corresponding to the matching data.

The noise data may specifically be a noise label (for example, noise label) obtained by performing modes such as label flipping and manual noise adding on a type label of the labeling point.

Specifically, based on the above concept, the original attention mask can be modified by combining the noise data in the object query data and the data structure of the matching data, and the modified attention mask is used as the matched attention mask. In particular as shown in fig. 13. The modified attention mask includes not only a matched mask region but also a denoising mask region. Based on the attention mask of the structure, referring to the cross mark in the transducer decoder in fig. 12, besides being capable of enhancing interaction between object query data indicating the same sample object, the method can also effectively prevent interaction query initiated by noise data in the object query data to matching data, so as to avoid learning of type label prediction based on the noise data due to leakage influence of the matching data on the noise data in the model training process; in addition, interaction between noise data in object query data for different sample objects and interaction between matching data can be effectively prevented. Thereby, the effectiveness of the model training process can be effectively ensured.

In some embodiments, referring to fig. 14, the sample modification model may be specifically trained as follows:

s1: acquiring and utilizing a full-label sample image according to a weak semi-supervised learning rule based on point labeling, and generating first sample data;

s2: constructing an initial reconstruction model; wherein the initial reconstruction model comprises at least an initial point amplification module;

s3: and training the initial reconstruction model by using the first sample data to obtain a sample reconstruction model meeting the requirements.

When the first sample data is specifically generated, one point can be selected from a labeling frame in the full-labeling sample image to serve as a labeling point, and an object type label indicating a sample object is marked on the labeling point; and simultaneously, marking an original labeling frame in the full labeling sample image as a real result for evaluating a model prediction result, and obtaining corresponding first sample data.

When the initial reconstruction model is specifically constructed, the constructed initial reconstruction model further comprises an initial feature extraction network, an initial transform module, an initial point encoder, an initial prediction head module and a matching loss module besides the initial point amplification module.

Specifically, when the first sample data is used for training the initial transformation model, in each round of training process, the first sample data can be processed by using the initial transformation model to obtain a corresponding prediction result; matching by using a matching loss module according to the prediction result and the real result of the first sample data, and calculating loss cost (including reconstruction loss related to type label prediction and matching loss related to boundary box prediction) by using the matching result based on the related loss function; and then according to the loss cost, the model parameters are optimized and adjusted in a targeted manner in the direction of minimum loss cost. Through training adjustment of multiple rounds, the accuracy of the model is continuously improved, so that the sample transformation model meeting the requirements is finally obtained.

In some embodiments, the matching loss module includes: hungarian matching loss module based on hungarian algorithm.

The hungarian algorithm (Hungarian algorithm) is one of the matching algorithms, i.e. the algorithm for searching the largest match in graph theory, and is mostly used for solving the related problem of bipartite graph matching.

In some embodiments, after determining the location information and the object type of the target object in the target image, the method may further include the following when implemented: automatically using a boundary box in the target image according to the position information and the object type of the target object in the target image, and selecting the target object in the target image by the box; and marking the object type of the target object at the critical position of the boundary box to obtain the processed target image. And displaying the processed target image to a worker.

In addition, during implementation, whether safety risks exist or not can be detected according to the position information and the object type of the target object in the target image; and under the condition that the safety risk is determined to exist, triggering and carrying out corresponding alarm prompt so as to prompt the staff to timely eliminate the related safety risk.

From the above, according to the target detection method provided by the embodiment of the present disclosure, before implementation, the point-labeled sample image, the full-labeled sample image, and the sample transformation model at least including the point amplification module may be utilized according to the weak semi-supervised learning rule based on point labeling, so as to train the target detection model with higher accuracy and better effect at a lower cost; in the specific implementation, whether a target object exists in the target image can be accurately determined by processing the target image by utilizing the target detection model; and further determining the position information and the object type in the target image under the condition that the target object exists in the target image. According to the weak semi-supervised learning rule based on point labeling, a sample transformation model at least comprising a point amplification module is adopted, only a small number of full-labeling sample images are needed to be acquired and used, the point labeling sample images with low labeling cost are mainly relied on and used, and then the target detection model with high precision and good effect can be quickly trained and obtained, so that the overall processing cost can be effectively reduced, and the target detection model obtained based on the training in the mode is utilized to efficiently and accurately realize target detection of image data.

The specification also provides a training method of the target detection model, which comprises the following steps:

Based on the embodiment, a large number of point marked sample images, a small number of full marked sample images and a sample transformation model at least comprising a point amplification module can be utilized according to the weak semi-supervised learning rule based on point marking, and a target detection model with higher precision can be obtained by training with lower cost and high efficiency through weak semi-supervised learning.

The specification also provides a training method of the sample transformation model, which comprises the following steps:

Based on the embodiment, the sample transformation model with good effect can be obtained through training based on weak semi-supervised learning of point labeling.

The embodiment of the present disclosure further provides a server, as shown in fig. 15, where the server includes a network communication port 1501, a processor 1502 and a memory 1503, where the foregoing structures are connected by an internal cable, so that each structure may perform specific data interaction.

The network communication port 1501 may be specifically configured to acquire a target image.

The processor 1502 may be specifically configured to process the target image by using a target detection model to obtain a corresponding target processing result; the target detection model is obtained by weak semi-supervised learning training through a point-marked sample image, a full-marked sample image and a sample transformation model according to a weak semi-supervised learning rule based on point marking in advance; the sample modification model at least comprises: a dot amplification module; determining whether a target object exists in the target image according to the target processing result; and determining the position information and the object type of the target object in the target image under the condition that the target object exists in the target image.

The memory 1503 may be used for storing a corresponding program of instructions.

In this embodiment, the network communication port 1501 may be a virtual port that binds with different communication protocols, so that different data may be transmitted or received. For example, the network communication port may be a port responsible for performing web data communication, a port responsible for performing FTP data communication, or a port responsible for performing mail data communication. The network communication port may also be an entity's communication interface or a communication chip. For example, it may be a wireless mobile network communication chip, such as GSM, CDMA, etc.; it may also be a Wifi chip; it may also be a bluetooth chip.

In this embodiment, the processor 1502 may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor, and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a programmable logic controller, and an embedded microcontroller, among others. The description is not intended to be limiting.

In this embodiment, the memory 1503 may include a plurality of layers, and in a digital system, it may be a memory as long as it can hold binary data; in an integrated circuit, a circuit with a memory function without a physical form is also called a memory, such as a RAM, a FIFO, etc.; in the system, the storage device in physical form is also called a memory, such as a memory bank, a TF card, and the like.

The embodiments of the present specification also provide a computer readable storage medium based on the above object detection method, the computer readable storage medium storing computer program instructions that when executed implement: acquiring a target image; processing the target image by using a target detection model to obtain a corresponding target processing result; the target detection model is obtained by weak semi-supervised learning training through a point-marked sample image, a full-marked sample image and a sample transformation model according to a weak semi-supervised learning rule based on point marking in advance; the sample modification model at least comprises: a dot amplification module; determining whether a target object exists in the target image according to the target processing result; and determining the position information and the object type of the target object in the target image under the condition that the target object exists in the target image.

In the present embodiment, the storage medium includes, but is not limited to, a random access Memory (Random Access Memory, RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard Disk (HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.

In this embodiment, the functions and effects of the program instructions stored in the computer readable storage medium may be explained in comparison with other embodiments, and are not described herein.

Referring to fig. 16, on a software level, the embodiment of the present disclosure further provides an object detection apparatus, which may specifically include the following structural modules:

an acquisition module 1601, specifically configured to acquire a target image;

the processing module 1602 may be specifically configured to process the target image by using a target detection model to obtain a corresponding target processing result; the target detection model is obtained by weak semi-supervised learning training through a point-marked sample image, a full-marked sample image and a sample transformation model according to a weak semi-supervised learning rule based on point marking in advance; the sample modification model at least comprises: a dot amplification module;

The determining module 1603 may be specifically configured to determine whether a target object exists in the target image according to the target processing result; and determining the position information and the object type of the target object in the target image under the condition that the target object exists in the target image.

In some embodiments, the apparatus may be further configured to, prior to acquiring the target image: acquiring a point labeling sample image and a full labeling sample image according to a weak semi-supervised learning rule based on point labeling; the number of the point labeling sample images is larger than that of the full labeling sample images; the full-labeling sample image at least comprises a labeling frame for selecting a sample object and a type label carried by the labeling frame and aiming at the sample object; the point labeling sample image at least comprises a labeling point used for indicating a sample object and a type label carried by the labeling point and aiming at the sample object; processing the point labeling sample image by using a sample transformation model to obtain a processed point labeling sample image carrying a pseudo labeling frame; combining the full-label sample image and the processed point label sample image to obtain a training sample set; and training the initial detection model by using the training sample set to obtain a target detection model meeting the requirements.

In some embodiments, the sample modification model may specifically further include: a feature extraction network, a transducer module, a point encoder, a prediction head module and the like;

In some embodiments, the feature extraction network is used for extracting features of the point labeling sample image input into the sample transformation model to obtain a corresponding sample feature map; the sample feature map is sent to a transducer module;

In some embodiments, the point encoder may include: an improved point encoder; wherein the improved point encoder comprises at least: an absolute position branch processing structure, a relative position branch processing structure, a category index branch processing structure, and a semantic alignment branch processing structure;

In some embodiments, the point encoder may specifically process the point labeling points in the sample image and the amplified pseudo-labeling points to obtain and output a corresponding joint code in the following manner: respectively carrying out preset coding processing according to the position coordinates of the marking points and the position coordinates of the amplified pseudo marking points through an absolute position branch processing structure to obtain corresponding absolute position codes; determining and marking the offset distances of the upper boundary, the lower boundary, the left boundary and the right boundary of the sample image according to the position coordinates of the marking points and the position coordinates of the amplified pseudo marking points and the amplified pseudo marking points relative to the point through a relative position branch processing structure, and generating corresponding relative position vectors; converting the relative position vector into corresponding relative ratio data; carrying out preset coding treatment on the relative ratio data to obtain corresponding relative position codes; processing the type label of the marking point and the type label of the amplified pseudo marking point through a category index branch processing structure to obtain a corresponding category index code; respectively carrying out linear mapping on the high-dimensional characteristics of the obtained marking points and the high-dimensional characteristics of the amplified pseudo marking points through a semantic alignment branch processing structure to obtain corresponding image semantic characteristic codes; the image semantic feature codes, the absolute position codes, the relative position codes and the category index codes have the same dimension; and combining the absolute position code, the relative position code, the category index code and the image semantic feature code to obtain corresponding joint codes.

In some embodiments, the apparatus may be implemented by using the absolute position code, the relative position code, the category index code, and the image semantic feature code in combination, to obtain the corresponding joint codes: performing addition operation by using the absolute position code, the relative position code and the category index code to obtain an intermediate code; and multiplying by using the intermediate code and the feature code to obtain the joint code.

In some embodiments, the transducer module may specifically include: a transducer encoder, a transducer decoder, etc.; wherein the transducer encoder is respectively connected with the feature extraction network and the transducer decoder; the transducer decoder is also connected with a point encoder and a prediction head module;

In some embodiments, the transform module may specifically generate and output corresponding intermediate result data according to the input sample feature map, the position code, and the joint code in a manner: processing the combination of the sample feature map and the position code by a transducer encoder to generate corresponding embedded data; and transmitting the embedded data to a transducer decoder; obtaining a matched attention mask by a transducer decoder; generating corresponding object query data by using the joint codes through a transducer decoder; and generating and outputting corresponding intermediate result data by using the attention mask, the object query data and the embedded data.

In some embodiments, the transducer decoder may specifically include: a decoding layer; wherein, the decoding layer includes: a self-attention layer, a cross-attention layer, and a feedforward neural network layer;

In some embodiments, the sample modification model may specifically further have a matching loss module connected thereto; the matching loss module is configured with a matching loss function and a denoising loss function.

In some embodiments, the apparatus, when embodied, may generate corresponding object query data using the joint encoding by a transducer decoder in the following manner: acquiring noise data based on type tags; generating corresponding matching data according to the object query data; combining the noise data and the matching data to obtain the object query data;

In some embodiments, the sample modification model may be specifically trained as follows: acquiring and utilizing a full-label sample image according to a weak semi-supervised learning rule based on point labeling, and generating first sample data; constructing an initial reconstruction model; wherein the initial reconstruction model comprises at least an initial point amplification module; and training the initial reconstruction model by using the first sample data to obtain a sample reconstruction model meeting the requirements.

It should be noted that, the units, devices, or modules described in the above embodiments may be implemented by a computer chip or entity, or may be implemented by a product having a certain function. For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, when the present description is implemented, the functions of each module may be implemented in the same piece or pieces of software and/or hardware, or a module that implements the same function may be implemented by a plurality of sub-modules or a combination of sub-units, or the like. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

From the above, according to the target detection device provided by the embodiment of the present disclosure, by adopting the sample transformation model including at least the point amplification module according to the weak semi-supervised learning rule based on the point labeling, only a small number of full-labeling sample images need to be acquired and used, and the target detection model with higher precision and better effect can be quickly trained and obtained by mainly relying on and using the point labeling sample images with lower labeling cost, so that the overall processing cost can be effectively reduced, and the target detection for the image data can be efficiently and accurately realized by using the target detection model obtained by training based on the above manner.

In a specific scenario example, the related ideas of the target detection method provided in the present specification may be applied to implement specific weak semi-supervised target detection.

In this scenario example, a super Point-DETR, SP-DETR, is proposed on the basis of Point-DETR. A point amplification module is introduced into a model (for example, a sample transformation model), a plurality of marking points are predicted for the same object, the one-to-many relationship between the object and the points is realized, and the detection difficulty is reduced. And, the relative position and the semantic alignment branches (for example, the relative position branch processing structure and the semantic alignment branch processing structure in the improved point encoder) are combined to further mine the labeling information, so that the encoding capability of the point encoder is enhanced. The object query generation of the Decoder and the matching flow of cross-section are optimized by using point codes, so that the convergence of the model is accelerated. Finally, an additional label denoising task is introduced for keeping the stability of Hungary matching to accelerate model convergence and improve model training efficiency.

Specifically, a weak semi-supervised target detection method based on point labeling is provided, a teacher model (e.g., a sample transformation model) mapped from a point to a boundary box is trained, and the model comprises a point amplification module, a point encoder, a Decoder optimization module and an additional label denoising task in addition to a basic component of DETR. The point amplification module amplifies the point marking information to reduce the detection difficulty; the stronger point encoder encodes the point marking information to generate an object query with more guiding capability; optimizing the generation of object query and optimizing cross-section matching by combining the thought of optimizing a Decoder by reference point coding (using a transducer Decoder in a transducer module), and accelerating the convergence of the transducer; and a denoising task is introduced to ensure the stability of the Hungary matching algorithm and accelerate the convergence of the model. Further, in the present scenario example, consider that because of the point of introduction augmentation module and the tag denoising task, a special attention mask implementation controls the interaction of the object query is also designed.

In specific implementation, referring to fig. 12, a point labeling-based weak semi-supervised target detection method is proposed, where the whole process involves a teacher model and a student model (e.g., a target detection model).

Wherein, as a teacher model, it includes: feature extraction modules (e.g., feature extraction networks), point amplification modules, point encoders, transformers modules, prediction header modules, and hungarian matching modules (e.g., match loss modules).

The teacher model is specifically used for training an encoder from a point to a boundary box based on the complete annotation data, and generating complete annotations for most point annotation data.

The student model is trained (resulting in a final target detection model) using the complete annotation data and the generated pseudo annotation data, and the student model defaults to the FCOS model.

The feature extraction module of the teacher model extracts features of the input image by using a ResNet50 backbone network to perform subsequent calculation; the point amplification module of the teacher model predicts coordinates of surrounding points by using high-dimensional features marked by the points, realizes the amplification of the point marks, reduces the detection difficulty, and introduces an attention mask for controlling the interaction among a plurality of object queries of the same object; the point encoder of the teacher model encodes the point labels into 256-dimensional point codes by using four calculation branches as the object query input of the subsequent Transformer Decoder; the transform calculation module of the teacher model comprises an Encoder Encoder and a Decoder, the encoding stage encodes the input image features, the decoding stage learns in the encoded features by using an object query (a group of learnable position codes in the transform), adopts the thought of optimizing the Decoder by using reference point codes, and optimizes the object query generation and cross-section matching process of the Decoder by using the point codes; the prediction head module of the teacher model predicts the relative positions of boundary frames of object queries calculated by a transform by using a multi-layer perceptron MLP; the hungarian matching module of the teacher model matches the predicted set with the given real set, and distributes the predicted set according to the principle of minimum cost, and the cost rule is determined by a group of loss functions. An additional tag denoising task is introduced to keep the stability of the Hungary algorithm, the noisy tag is also used as an object query, the object query is divided into a matching part and a denoising part, and new constraints are added on the basis of the attribute mask introduced by the point amplification module.

When the teacher model is implemented, the teacher model can be constructed and trained according to the following steps.

Step 1: coordinate points are selected from the complete labels and point labels are generated (first sample data is obtained).

In this scenario example, in order to train the encoding of a point to a bounding box by using a small amount of completely labeled data, first, a coordinate point is randomly selected from the completely labeled bounding box, and the up-down, left-right position offset of the point relative to the bounding box is calculated as a target result to be predicted, and point labeling information is generated in combination with category information.

Step 2: the features are extracted for the input picture using a res 50 backbone network.

In this scenario example, the ResNet50 has 50 layers in total, and can be divided into five parts according to feature scales, and the convolution kernels used in each part have different sizes and numbers, so that feature graphs with different scales and channel numbers are respectively output to obtain C2, C3, C4 and C5.SP-DETR is based on the DETR model, using the characteristics of the last layer C5, 2048, dimension of the output for encoding and decoding.

Step 3: and (5) amplifying the point labels by using a point amplifying module and designing a corresponding attention mask.

In general, point-DETR mostly adopts a one-to-one strategy, i.e. one object uses one Point annotation, which is obtained by a Point encoder as an object query. In the transducer, the object query is responsible for detecting the object for which the point label is responsible. Because the one-to-one strategy is used, the sizes of the prediction set and the real set are equal, each prediction result only calculates loss with the real result of the corresponding object, collection matching is not needed, and the Hungary matching module is removed from the Point-DETR. However, the best result of the object corresponding to the prediction is represented by marking the result of each prediction in the training process, and the problem that the whole object is difficult to position by using only one point exists, so that a longer training period is required. Secondly, the learning of the model is limited to the point labeling positions, and the generalization performance of different point labeling positions is poor. Based on this, SP-DETR proposes a spot amplification module, the structure of which is shown in FIG. 5.

In this scenario example, the implementation flow of the adopted point augmentation module is as follows: for each given annotation point, a bilinear interpolation sampling method is used to sample from a feature map with 2048 channels of the last layer extracted from the main network, which is the high-dimensional feature input to the transducer (the feature input to the transducer is a single-scale feature, and SP-DETR is based on DETR), so as to acquire the corresponding high-dimensional feature at the coordinate, and then input the feature to a designed position offset prediction branch. By predicting the branch, the position offset of 4 points (for example, auxiliary points) around the original point coordinate is predicted, and then the actual coordinate of the predicted point is obtained according to the coordinate of the original point. Each of these 5 points represents a point responsible for the same object, changing the original one-to-one strategy to one-to-many strategy.

For the newly generated point coordinates, corresponding labeling information needs to be supplemented for training, and the category of the newly generated point coordinates is consistent with the category information of the original point. For boundary frame annotation information, because the boundary frame prediction of the SP-DETR adopts the representation of the vertical and horizontal position offset of the current point coordinates relative to the original real boundary frame, the position of the original real boundary frame can be calculated according to the boundary frame information of the original point, and then the relative position of each newly generated point relative to the real boundary frame is calculated as the boundary frame annotation information. This strategy is only adopted in the training phase in this embodiment, since the verification phase lacks information for updating.

Because a one-to-many strategy is adopted, a Hungary matching component is still adopted in the SP-DETR, the boundary box loss is used as the cost loss calculation, the optimal matching among a plurality of point labeling prediction results of the same object is responsible, the learning difficulty is reduced, and the robustness of the model to point labeling position learning is improved. In addition, attention mask is also required. The attribute mask is a matrix used by the self-attribute module of the Decoder in the transducer to control the interactions of the object queries, and the size of the matrix is the number of the object queries. For the object queries entered into Transformer Decoder, because of the one-to-many (5) strategy, every 5 object queries are the combination responsible for detecting the same object, which need to interact with each other and need not interact with object queries responsible for other objects. The corresponding rank value of 0 in each group in the Attention mask indicates that interaction is performed, and the rank value of minus infinity between different groups indicates that no interaction is performed. The patent mask designed herein: a is that ¹ For visual understanding, 1 and 0 in the figure represent 0 and minus infinity in the actual encoding, respectively, as shown in fig. 10.

Step 4: the point encoder encodes point annotation information using four computational branches.

In this scenario example, the point encoder is the core structure of the entire model, which enables mapping from point labels to bounding boxes. In addition to the inclusion of absolute position branches (e.g., absolute position branch processing structures) and randomly initialized learnable class codes (e.g., class index branch processing structures), the addition of relative position coding branches (e.g., relative position branch processing structures) and feature semantic alignment branches (e.g., semantic alignment branch processing structures) to the point encoder in SP-DETR results in an overall structure of the point encoder as shown with reference to fig. 6.

Based on the relative position coding branches, how to fully utilize the limited labeling information is the starting point of semi-supervision and weak supervision method design. The given point, in addition to its own absolute coordinates and category information, is based on the picture's relative position coordinates, which are represented by its offset relative to the picture's four boundaries, up, down, left, and right. For the object to be detected, its relative position to the picture is also an important factor in assisting its detection.

For this purpose, a relative position calculation branch is designed, and a (top, down, left, right) vector calculated by each point can be converted into a (top/down, left/right) vector, and then the (top/down, left/right) vector is encoded by using a sin encoding mode to obtain an encoding representing the relative position of the point. The code and the previous code are added as a subsequent object query (e.g., object query data).

Regarding feature semantic alignment branching, matching is a common concept in visual tasks, especially in contrast tasks such as face recognition, object tracking, and the like. The core idea is to predict the similarity between inputs. Empirical results show that the architecture based on Siamese (twin network) projects both sides to be matched into the same coding space, which is excellent in tasks involving matching. Based on this, the cross-attention of DETR can be interpreted as a "match and feature refinement" process. To achieve fast convergence, the pair Ji Yuyi between the object query and the image feature encoding needs to be ensured, i.e. they are both projected into the same encoding space. The Object query code is randomly projected into the code space at initialization, requiring a very long training period to learn a meaningful match between image features. To this end, feature semantic alignment branches are added. Extracting high-dimensional features of corresponding positions of point labels by a bilinear interpolation sampling method, performing linear mapping on the features to obtain codes consistent with the existing code dimensions of a point encoder, and performing multiplication operation on the codes to obtain new point codes, thereby realizing alignment of the features and object query semantics.

Step 5: the object query and Decoder matching is optimized based on point coding.

In this scenario example, the transducers (e.g., transducer modules) include an Encoder and a Decoder. Wherein the output of the Encoder is a multi-scale feature of the same resolution as the input, and the key and query are pixels from the multi-scale feature. The Decoder consists of a set of Decoder layers (e.g., coding layers), each of which in turn consists of three main parts: (1) self-attention layer for removing duplicate predictions for interactions between object queries. (2) Cross-attention cross-attention layer, aggregating the ebadd output from the Encoder to refine the object query, further improving the prediction of class and bounding boxes. (3) FFN layers (e.g., feedforward neural network layers).

Specifically, the description that can be formulated from the object query prediction bounding box is shown with reference to equation 1:

box＝sigmoid(FFN(f)+[s ^T 0 0] ^T ) (1)

wherein f represents an object query; the bounding box may be a four-dimensional vector denoted b _cx ,b _cy ,b _w ,b _h ] ^T Respectively representing the center coordinates of the boundary frame, the width and the height of the boundary frame; sigmoid is used to normalize bounding boxes [0,1 ]]Within the range; FFN is used to predict non-normalized bounding boxes; s is the coordinates of a non-normalized reference point, which is (0, 0) in DETR, which is derived by feature prediction in SP-DETR.

In particular implementations, the process of matching decoders of DETR may be described as: cross-section of the Decoder includes three inputs, queries, keys and values. Wherein Key is through a content Key c _k (output from Encoder) and spatial key p _k The addition results in Value, which is also the ebadd output from the Encoder, like the content key, and in the DETR, each query is the content query c output by self-attribute _q And a spatial query p _q (object query o _q ) And adding to obtain the final product.

The relevant attention weights are calculated based on the query and key point multiplication as shown in equation (2):

(c _q +P _q )T(c _k +p _k )

＝c _q ^T c _k +c _q ^T P _k +p _q ^T c _k +p _q ^T p _k

＝c _q ^T c _k +c _q ^T p _k +o _q ^T c _k +o _q ^T p _k (2)

the SP-DETR is different from DETR, not encoded by means of reference points, but with point encoding obtained by point encoder encoding. The thought of the reference point is to accurately position the reference point by continuously adjusting the prediction parameters, and the accurate reference information is given to the point mark at the beginning, so that the optimal scene can be better adapted. Referring to the idea of optimizing the reference point code, referring to fig. 11, the point code is directly used as p without going through the learning code to the two-dimensional coordinate and then to the coordinate code _s The generation of T also uses FFN prediction through Decoder embedding, but instead of Conditional-DETR, decoder embedding is initialized to an all 0 vector consistent with the object query shape, which is initialized here to point coding, which can reduce its learning difficulty and introduce guide information through point labeling.

Step 6: in order to ensure the matching stability of the Hungary algorithm, an additional tag denoising task is introduced.

In this scenario example, considering that DETR slow convergence is also caused by the instability of hungarian matching, the component is unstable in the early stage of training due to the nature of random optimization, and for the same image, queries often match different objects in different batches, making the optimization fuzzy and unstable. For this purpose, a denoising task is introduced to reduce the instability of the hungarian algorithm matching.

In general, besides Hungary loss, the DN-DETR inputs Transformer Decoder a noisy real bounding box, trains a model to reconstruct the real bounding box, effectively reduces the difficulty of bipartite graph matching, andfaster convergence is achieved. Because the noisy bounding box does not need bipartite graph matching, the denoising task can be regarded as a simpler auxiliary task, helps the DETR to relieve unstable discrete bipartite matching, learns bounding box prediction faster, and meanwhile, the denoising task also helps to reduce optimization difficulty because the added random noise is smaller. To maximize the potential of auxiliary tasks, the query of the Decoder is treated as bounding box-like label-denoising encoding, with both bounding box and label denoising. The Decoder inquiry consists of two parts, namely a matching part, the processing mode of the matching part is the same as that of the DETR, bipartite graph matching is adopted, and the matched Decoder output is used for learning to predict a real boundary frame label pair, in addition, the input of the decoding part is a noisy real boundary frame and label pair, and the output of the denoising part aims at reconstructing a real object. And for label noise, label overturning is adopted to randomly overturn part of labels into other labels. For bounding box noise, then center shifting (adding noise to the bounding box center coordinates) and bounding box scaling are used (the bounding box width and height are respectively at [ (1-lambda ] ₂ )w,(1+λ ₂ )w]And [ (1-lambda ] ₂ )h,(1+λ ₂ )h]Random sampling, lambda ₂ ∈(0,1))。

While SP-DETR is based on the DETR model, unlike DN-DETR is based on the DAB-DETR model. DN-DETR initializes the object query with a four-dimensional learnable code (x, y, w, h) as a learnable bounding box code, the output of each layer of decoders contains (Δx, Δy, Δw, Δh), and the code can be updated to (x+Δx, y+Δy, w+Δw, h+Δh) to participate in the next layer of computation. Decoder embedding is randomly initialized. And when the noise is added, decoder embedding is spliced to add the noise label, and the object query is spliced to add the noise boundary box, so that subsequent calculation is realized.

The SP-DETR takes the point code of the point encoder as an object query, and the boundary box code is not used, and the code semantic of the point code cannot be used as the boundary box code, so that only a label denoising task is introduced when the denoising task is introduced into the SP-DETR. Because the point labeling information contains category information, the loss function of the SP-DETR matching part does not contain label loss, and the label denoising task is introduced into the SP-DETR, so that the accuracy of boundary box prediction is improved in an auxiliary manner. In the implementation, decoder embedding is also initialized to the point code in the previous step, so that the new Decoder embedding is obtained by splicing the point codes and adding the noise to the label, and the noise adding mode is also realized by turning the label. For the object query also obtained by point coding, the vector shapes of the transform need to be identical to those of the object query Decoder embedding, and under the condition that a noise adding boundary box is not added, the existing object query is expanded according to the shape size of Decoder embedding, and the full 0 vector is expanded, so that the label noise adding part is not affected when the full 0 vector is expanded. Accordingly, the auxiliary loss is only tag loss.

As a result of adding a denoising task, the Decoder object query includes a matching portion (e.g., matching data) and a denoising portion (noise data), and as shown in fig. 12, an attention mask input to the Decoder needs to be adjusted. The point augmentation module has set an attention mask, but the previous mask only considers the matching part, responsible for controlling interactions between each group of object queries of the same object by the mask. To avoid information leakage, a dimension of control needs to be added on the basis of the information leakage, namely, the attention mask is divided into a matching part and a denoising part. The matching part is still divided according to the constraint of the point amplification module, and the object query of the denoising part is visible to the matching part, because the denoising part only contains tag information, the matching part also contains tag information, information leakage cannot be caused, the matching part is invisible to the denoising part, and the real tag information contained in the matching part can interfere with prediction of the denoising part. For the denoising part, the denoising groups do not interact with each other because the denoising part comprises a plurality of versions of denoising groups. Newly added mask control A ² As shown in equation (3), the final mask A ³ Visualization referring to fig. 13, for visual understanding, 1 and 0 in the figure represent 0 and minus infinity in equation (3), respectively.

Wherein A is ² _i,j Representing object query q _i Whether or not to match object query q _j Interacting, equation (3) only represents the newly added mask control, withoutDesign of the matching part; p represents the number of noise adding groups; m represents the number of group noisy labels; for the whole mask, the denoising part is represented by the previous P multiplied by M rows and the rest rows are matching parts; negative infinity indicates no interaction; 0 indicates that interaction is possible.

Step 7: and setting and calculating a loss function.

In this scenario example, during training, the object query of the Decoder includes a denoising portion and a matching portion, resulting in overall loss L, due to the addition of additional denoising tasks _SP-DETR The definition is shown in formula (4). Loss L of matching part _match (e.g., matching the loss function) is defined as shown in equation (5), since the class information is contained in the object query, only the loss calculation is required for the bounding box, the bounding box loses L _box The same as defined for DETR. Prediction for bounding boxesBy usingCalculation of>Is (x, y, x, y), (x, y) is the coordinates of the point label,/->Is the relative offset. The denoising partial loss is defined as L _denose (e.g., denoising penalty function), which contains only tag calculation penalty, and in particular implementation, focal Loss is used.

Overall loss L _SP-DETR (including the matching portion loss and the denoising portion loss) is expressed as:

L _SP-DETR ＝L _match +L _denose 。 (4)

wherein the matching section loss is expressed as:

wherein L is _box Based on L ₁ Loss and GIoU loss as shown in equation (6):

wherein lambda is _iou The value of the sample is 2 in the experiment;the value in the experiment is 5; l (L) _iou As shown in formula (7):

wherein, || represents the area, the result frame itself is represented by the intersection and union of the predicted frame coordinates and the real frame coordinates, and the area of the intersection or union passes b _σ(i) Andthe minimum/maximum value of the linear function of (2) is calculated, so that the loss function is smooth enough to be convenient to optimize by using a random gradient descent algorithm; />Is meant to include b _σ(i) And->The area of the largest box (B) is also calculated by the min/max value of the linear function of the box coordinates.

The expression of the Focal loss is shown in formula (8):

wherein alpha is _t Representing a balance factor; gamma represents a regulatory factor; alpha _t Is prepared from ginsengThe more the number of constant categories, alpha _t The smaller, set to 0.25 in the experiment; gamma is used to adjust the proportional weight of samples of different confidence in the total loss, set to 2 in the experiment.

After training to obtain the teacher model in the above manner, the teacher model may be used to process the point labeling data (e.g., the point labeling sample image) to obtain corresponding pseudo labeling data (e.g., the processed point labeling sample image carrying the pseudo labeling frame), and then the FCOS model is trained together by using the complete labeling data and the pseudo labeling data generated based on the teacher model to obtain the corresponding student model.

The final performance based on the resulting student model may reflect the advantages of the teacher model built based on the present method. Under the condition that the performance of the student model meets the requirement, the student model can be used for specific target detection later.

It should be added that in this scenario example SP-DETR mainly follows most training settings in DETR, but there are still many differences in settings: in this scenario example, 108 epoch models are trained, and the GPU sets a batch size of 2; the number of noisy groups in the denoising task, p=5, is set, and the number of noisy labels M of each group is the sum of the label numbers of all the images in the batch. Also, to ensure stability of training, in the present scenario example, a warm-up training regimen is used in the first training batch, with learning rates reduced by a factor of 10 at 72 and 96epoch, respectively.

In addition, based on the method provided in the present specification, the implementation uses Python 3.6.13 for encoding, and training and verification of the model are performed in pythorch deep learning framework, and all experiments are run on hardware devices of ubuntu20.04.4 operating system, NVIDIA GeForce GTX Ti GPU, and 11G video memory.

Through the scene example, it is verified that the target detection method provided by the specification has low prediction difficulty in consideration of the fact that the Point-to-Point query is encoded as the object query in the Point-to-Point query to participate in the DETR training process. And, because after the point labeling scheme is adapted to FCOS and Faster R-CNN, the complete feature map prediction is changed into point feature prediction, so that feature information is lost, and the prediction difficulty is high. Therefore, the introduced SP-DETR is added with a point amplification module, an optimized point encoder, a matching process of an optimized Decoder cross-section and a denoising task on the basis of the SP-DETR, so that the overall performance of the model is improved. In addition, the Group RCNN projects the marking points to each layer of feature images of the FPN, and a plurality of points are selected from the feature images near each projection point to generate proposals, which are similar to the point amplification module of the SP-DETR in function, but are based on the multi-scale features of the R-CNN; the Instance-aware Parameter Generation module is similar to the point encoder of SP-DETR, and computes a set of point features representing the same object and then splices with class encoding. The SP-DETR point encoder is more complex, can fully mine point marking information, and has stronger point-to-boundary box mapping capability based on optimization of the point encoding to the Decoder matching process and design of denoising tasks.

Although the present description provides method operational steps as described in the examples or flowcharts, more or fewer operational steps may be included based on conventional or non-inventive means. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented by an apparatus or client product in practice, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment, or even in a distributed data processing environment). The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, it is not excluded that additional identical or equivalent elements may be present in a process, method, article, or apparatus that comprises a described element. The terms first, second, etc. are used to denote a name, but not any particular order.

Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller can be regarded as a hardware component, and means for implementing various functions included therein can also be regarded as a structure within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer-readable storage media including memory storage devices.

From the above description of embodiments, it will be apparent to those skilled in the art that the present description may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the present specification may be embodied essentially in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and include several instructions to cause a computer device (which may be a personal computer, a mobile terminal, a server, or a network device, etc.) to perform the methods described in the various embodiments or portions of the embodiments of the present specification.

Various embodiments in this specification are described in a progressive manner, and identical or similar parts are all provided for each embodiment, each embodiment focusing on differences from other embodiments. The specification is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

Although the present specification has been described by way of example, it will be appreciated by those skilled in the art that there are many variations and modifications to the specification without departing from the spirit of the specification, and it is intended that the appended claims encompass such variations and modifications as do not depart from the spirit of the specification.

Claims

1. A method of detecting an object, comprising:

acquiring a target image;

2. The method of claim 1, wherein prior to acquiring the target image, the method further comprises:

3. The method of claim 2, wherein the sample modification model further comprises: the system comprises a feature extraction network, a transducer module, a point encoder and a prediction head module;

4. The method according to claim 3, wherein the feature extraction network is configured to perform feature extraction on a point labeling sample image input into the sample modification model, so as to obtain a corresponding sample feature map; the sample feature map is sent to a transducer module;

5. A method according to claim 3, wherein the point encoder comprises: an improved point encoder; wherein the improved point encoder comprises at least: an absolute position branch processing structure, a relative position branch processing structure, a category index branch processing structure, and a semantic alignment branch processing structure;

6. The method of claim 5, wherein the point encoder labels the labels in the sample image by processing the labels and the amplified pseudo labels to obtain and output corresponding joint codes as follows:

7. The method of claim 6, wherein using the absolute position encoding, the relative position encoding, the class index encoding, and the image semantic feature encoding in combination results in a corresponding joint encoding, comprising:

8. The method of claim 6, wherein the transducer module comprises: a transducer encoder and a transducer decoder; wherein the transducer encoder is respectively connected with the feature extraction network and the transducer decoder; the transducer decoder is also connected with a point encoder and a prediction head module;

9. The method of claim 8, wherein the transform module generates and outputs corresponding intermediate result data in a manner based on the input sample signature, the position encoding, and the joint encoding:

obtaining a matched attention mask by a transducer decoder;

10. The method of claim 9, wherein the transducer decoder comprises: a decoding layer; wherein, the decoding layer includes: a self-attention layer, a cross-attention layer, and a feedforward neural network layer;

11. The method of claim 9, wherein the sample modification model is further coupled with a match loss module; the matching loss module is configured with a matching loss function and a denoising loss function.

12. The method of claim 11, wherein generating corresponding object query data using the joint encoding by a transducer decoder comprises:

acquiring noise data based on type tags;

generating corresponding matching data according to the object query data;

combining the noise data and the matching data to obtain the object query data;

13. The method of claim 2, wherein the sample modification model is trained in the following manner:

14. An object detection apparatus, comprising:

the acquisition module is used for acquiring a target image;

15. A server comprising a processor and a memory for storing processor-executable instructions, which when executed by the processor implement the steps of the method of any one of claims 1 to 13.