CN114580548A

CN114580548A - Training method of target detection model, target detection method and device

Info

Publication number: CN114580548A
Application number: CN202210222886.9A
Authority: CN
Inventors: 陶大程; 王文
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2022-06-03

Abstract

The embodiment of the invention discloses a training method of a target detection model, a target detection method and a target detection device. The method comprises the following steps: inputting sample data into a target detection model to be trained to obtain detection results output by each sub-detection module in the target detection model, wherein in the process of processing the sample data by the target detection model to be trained, a characteristic extraction module extracts initial image characteristics and inputs the initial image characteristics into a detection module, the sub-detection module in the detection module is used for determining the detection results of the sub-detection module for the input image characteristics, sparsely sampling the initial image characteristics based on the detection results of the sub-detection module, and inputting the sampling characteristics serving as the input image characteristics into the next sub-detection module; and determining a loss function based on the detection result output by each sub-detection module and the labeled data of the sample data, and adjusting the network parameters in the target detection model based on the loss function. The method and the device realize the reduction of the dependence on a large amount of labeled sample data, and save the time cost and the labor cost.

Description

Training method of target detection model, target detection method and device

Technical Field

The embodiment of the invention relates to the technical field of deep learning, in particular to a training method of a target detection model, a target detection method and a target detection device.

Background

Object detection is one of the fundamental problems of computer vision, which aims at locating and classifying instances of objects of human interest. The target detection has wide application value, for example, the target detection can be used for pedestrian detection in security, the target detection can be used for detecting defective parts in an industrial scene, and the target detection can be used for detecting a geographical area of human interest in a remote sensing system.

At present, target detection can be realized through a machine learning model, but the machine learning model depends on a large amount of labeled data, the labeled data is difficult to obtain, and a large amount of labor cost and time cost are needed particularly under the condition of a large amount of training data.

Disclosure of Invention

The invention provides a training method of a target detection model, a target detection method and a device, which aim to reduce a large amount of dependence on sample data in the training process of the target detection model.

According to an aspect of the present invention, there is provided a training method of a target detection model, including:

acquiring training sample data, performing the following iterative training on a target detection model to be trained based on the training sample, and obtaining the trained target detection model under the condition that a training end condition is met, wherein the target detection model comprises a feature extraction module and a detection module, and the detection module comprises a plurality of sub-detection modules which are connected in sequence:

inputting the sample data into the target detection model to be trained to obtain detection results output by each sub-detection module in the target detection model, wherein in the process of processing the sample data by the target detection model to be trained, a feature extraction module extracts initial image features and inputs the initial image features into the detection module, the sub-detection module in the detection module is used for determining the detection results of the sub-detection module for the input image features, the initial image features are subjected to sparse sampling based on the detection results of the sub-detection module, and the sampling features are input into the next sub-detection module as the input image features;

and determining a loss function based on the detection result output by each sub-detection module and the labeled data of the sample data, and adjusting the network parameters in the target detection model based on the loss function.

According to another aspect of the present invention, there is provided an object detection method including:

acquiring an image to be detected, and inputting the image to be detected into a pre-trained target detection model to obtain a target position and a target classification in the image to be detected;

wherein, the target detection model includes feature extraction module and detection module, detection module is including a plurality of sub-detection module that connect gradually, the feature extraction module draws initial image characteristic, and input extremely detection module, sub-detection module among the detection module is used for confirming the input image characteristic sub-detection module testing result, based on sub-detection module's testing result is right initial image characteristic carries out sparse sampling, inputs next sub-detection module with the sampling characteristic as the input image characteristic, waits to detect target position and target classification in the image until terminal sub-detection module output.

According to another aspect of the present invention, there is provided a training apparatus for an object detection model, configured to:

acquiring training sample data, performing iterative training on a target detection model to be trained based on the training sample, and acquiring a trained target detection model under the condition that a training end condition is met, wherein the target detection model comprises a feature extraction module and a detection module, and the detection module comprises a plurality of sub-detection modules which are connected in sequence;

the device includes:

the detection result determining module is used for inputting the sample data into the target detection model to be trained to obtain detection results output by each sub-detection module in the target detection model, wherein in the process of processing the sample data by the target detection model to be trained, the feature extracting module extracts initial image features and inputs the initial image features into the detection module, the sub-detection modules in the detection module are used for determining detection results of the sub-detection modules for input image features, the initial image features are subjected to sparse sampling based on the detection results of the sub-detection modules, and sampling features are input into the next sub-detection module as input image features;

the loss function determining module is used for determining a loss function based on the detection result output by each sub-detection module and the labeled data of the sample data;

and the network parameter adjusting module is used for adjusting the network parameters in the target detection model based on the loss function.

According to another aspect of the present invention, there is provided an object detecting apparatus including:

the image acquisition module is used for acquiring an image to be detected;

the target detection module is used for inputting the image to be detected into a pre-trained target detection model to obtain the target position and the target classification in the image to be detected;

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform a method of training an object detection model according to any of the embodiments of the invention, and/or the object detection method.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement a method for training an object detection model according to any one of the embodiments of the present invention, and/or the object detection method when the processor executes the method.

According to the technical scheme, in the training process of the target detection model, sparse sampling is carried out on the initial image features based on the detection result of the previous sub-detection module among the sub-detection modules in the detection module of the target detection model, sparse features processed by the next sub-detection module are obtained, in the training process, the sparse features are rapidly determined, the process of learning the sparse features from a large number of dense features is replaced, the learning process of the model is simplified, meanwhile, the redundant features learned in the model are reduced, the learning efficiency is improved, the dependence on a large number of labeled sample data is reduced, the labeled sample data volume is reduced, and the time cost and the labor cost for labeling the sample data are saved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a training method of a target detection model according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a target detection model according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a target detection model provided in an embodiment of the present invention during a training process;

FIG. 4 is a schematic diagram of an efficient attention mechanism based on sparse features provided by an embodiment of the present invention;

fig. 5 is a schematic flowchart of a target detection method according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a training apparatus for a target detection model according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Fig. 1 is a schematic flow diagram of a training method for a target detection model according to an embodiment of the present invention, which is applicable to a case where a target detection model is trained quickly based on a small amount of sample data, where the method may be executed by a training apparatus for a target detection model according to an embodiment of the present invention, the training apparatus for a target detection model may be implemented by software and/or hardware, and the training apparatus for a target detection model may be configured on an electronic computing device, and specifically includes the following steps:

s110, inputting the sample data into the target detection model to be trained to obtain detection results output by each sub-detection module in the target detection model.

S120, determining a loss function based on the detection result output by each sub-detection module and the labeled data of the sample data, and adjusting the network parameters in the target detection model based on the loss function. And under the condition that the training end condition is met, obtaining the trained target detection model, and under the condition that the training end condition is not met, returning to the step S110 for carrying out the next iterative training.

In this embodiment, the trained object detection model has a function of performing object detection on image frames in an image or video, where the type of the detected object may be determined according to business requirements, and the detected object may include, but is not limited to, pedestrians, vehicles, abnormal or defective products, a region of interest of a focus in a medical image, and the like. Correspondingly, the target detection model with the detection function is obtained by training based on sample data in the corresponding service scene.

The target detection model comprises a feature extraction module and a detection module, wherein the detection module comprises a plurality of sub-detection modules which are connected in sequence. The number of the sub-detection modules in the detection module is not limited in this respect, and may be set according to detection requirements, and in some embodiments, the detection module may include 6 sub-detection modules. It should be noted that, in this embodiment, specific structures of the feature extraction module and the detection module are not limited, and for example, the feature extraction module may be a convolutional network structure, may also be an encoder, may also be a combined structure of the convolutional network structure and the encoder, and the like; meanwhile, the structure of each sub-detection module in the detection module is not limited, and the structures of the sub-detection modules can be the same or different, so that the detection result can be identified.

The method includes the steps of obtaining sample data in a current service scene, illustratively, the service scene is pedestrian identification, correspondingly, the sample data can be images including pedestrians, or image frames in a monitoring video collected through a monitoring camera, and correspondingly, the marking data of the sample data can be images including a pedestrian detection frame or a pedestrian identification, wherein the pedestrian detection frame can be a minimum extension frame of the pedestrians, and the pedestrian identification can be a mark set in a region where the pedestrians are located. The service scene is the defective product identification, correspondingly, the sample data can be a product image, and the labeled data of the sample data can be an image including a defective product detection frame or a defective product mark.

And performing iterative training on the target detection model to be trained based on the sample data, and adjusting the network parameters of the target detection model in each iterative training until the training end condition is met to obtain the trained target detection model. Wherein the training end condition may be one or more of the following: reaching the preset training times, reaching the preset detection precision and reaching the convergence of the training loss.

In each iteration training, sample data is input into a target detection model of the current iteration, and detection results output by each sub-detection model in the detection module are obtained. In each iterative training and the processing process of the target detection model on the sample data, the training sample is input to the feature extraction module, the initial image feature of the sample data is extracted by the feature extraction module, and the initial image feature is input to the detection module. In some embodiments, the feature extraction module includes a basic feature extractor for extracting two-dimensional basic features of sample data, and an encoder for performing feature extraction on the two-dimensional basic features and outputting one-dimensional image sequence features. Illustratively, the target detection model is composed of a convolutional neural network structure and a Transformer network structure, and correspondingly, the basic feature extractor can be a convolutional neural network structure, such as a CNN network structure; the encoder may be an encoder portion of a transform and the detection module is a decoder portion of the transform.

The feature extraction module inputs the extracted initial image features, namely the one-dimensional image sequence features, into a first sub-detection module in the detection module, and the first sub-detection module processes the initial image features to obtain the detection results of the sub-detection modules. The method comprises the steps of carrying out sparse sampling on initial image features through a detection result of a first sub-detection module, inputting sampling features serving as input image features into a second sub-detection module, processing the input image features by the second sub-detection module to obtain a detection result of the sub-detection module, carrying out sparse sampling on the initial image features serving as the input image features through a detection result of the second sub-detection module, inputting the sampling features serving as the input image features into a third sub-detection module, and the like until a last sub-detection module outputs a detection result of a sample image.

In the training process of the target detection model, sparse features related to a specific object example to be detected need to be aggregated from dense image features, so that a large amount of sample data and labeled data are needed to learn the process, and the training process of the target detection model depends on a large amount of sample data and labeled data. In this embodiment, the initial image features are sparsely sampled according to the detection result determined by each sub-detection module for the input image features, the sampling features are input to the next sub-detection module as the input image features, and the sparsely sampled input image features of the next sub-detection module are determined in a plug-and-play manner, so that redundant features in the input image features of the next sub-detection module are reduced, the amount of processed data is reduced, the processing pertinence and accuracy of the next sub-detection module are improved, the learning difficulty of the target detection model is reduced, and the processing efficiency and the learning efficiency of the target detection model are improved.

The structure of each sub-detection module and the processing mode of the sub-detection module on the features of the input image are not limited herein. In some embodiments, the sub-detection module may be a convolutional network structure, performing convolution processing on the input image features; in some embodiments, the sub-detection module may be a decoding layer in a transform network structure, and is configured to perform a decoding process on the input image features.

In this embodiment, the sampling manner of sparse sampling is not limited, and examples may include but are not limited to a RoI Align sampling manner (i.e., a bilateral linear interpolation is used to extract features of a location box region of interest), a grid sampling manner, and the like.

In some embodiments, the sparse sampling process may be performed by a processor other than the target detection model, for example, each sub-detection module outputs the detection result to the processor, the processor sparsely samples the initial image feature based on the received detection result, and inputs the sparsely sampled sampling feature to the next sub-detection module.

In some embodiments, each of the sub-detection modules except the last sub-detection module in the detection module is provided with a sparse sampling unit, which is configured to perform sparse sampling on the initial image features based on the detection result of the current sub-detection module, and transmit the initial image features to the next sub-detection module.

In some embodiments, the target detection model further includes a sparse sampling module, which is independent of each sub-detection module, and is connected to each sub-detection module, respectively, and configured to receive a detection result of a previous sub-detection module, perform sparse sampling on the initial image features based on the detection result, and input sampled features to a next sub-detection module. Exemplarily, referring to fig. 2, fig. 2 is a schematic structural diagram of an object detection model provided in an embodiment of the present invention. By arranging the sparse sampling module in the target detection model, sparse sampling is realized in the characteristic transmission process among all the sub-detection modules, and meanwhile, secondary development of all module structures in the target detection model is not required, so that the secondary development cost is avoided.

On the basis of the above embodiment, the detection result includes a detection frame and a classification probability, where the detection frame is used to select a target object of the sample image frame, that is, the target object in the sample image is located in the detection frame, and the detection frame in the detection result may be at least one. The classification probability is a probability value of a type to which the target object belongs in the detection box, and may be, for example, a probability value of each type to which the target object belongs, where the classification probability may be output in a vector form. Each sub-detection module outputs the current detection result, and the output detection results of different sub-detection modules can be different.

Correspondingly, the sparsely sampling the initial image features based on the detection result includes: and respectively carrying out sparse sampling on the initial image characteristics and the position coding data of the characteristic extraction module based on the detection frame to obtain image sampling characteristics and coding sampling characteristics. In this embodiment, each sub-detection module is each decoder layer of the transform decoder, and accordingly, the input information of each decoder layer includes the input image feature and the position encoded data, and the input image feature after sparse sampling and the position encoded data after sparse sampling are obtained in a sparse sampling manner, so that the next decoder layer can process the sparse feature conveniently, redundancy is reduced, and processing efficiency is improved. Meanwhile, sparse sampling is carried out through a detection frame in a detection result of the last decoder layer, specifically, a sampling range is determined in initial image features and position coding data based on the detection frame, and sparse sampling is carried out in the sampling range, interference of redundant features outside the detection frame is reduced, accuracy of the sampled features obtained through sparse sampling is improved on the basis of reducing data volume, the process of learning the redundant data is reduced, and the learning difficulty of a model is reduced.

On the basis of the above embodiment, the sub-detection module comprises a self-attention unit, a cross-attention unit, a feed-forward network unit and a detection unit; wherein the self-attention unit is used for extracting self-attention features of the current decoding sequence features; the cross attention unit is used for extracting cross attention features from the self attention features and the input image features based on a feature mapping mode; the feedforward network unit is used for obtaining an updated decoding sequence characteristic based on the cross attention characteristic; the detection unit is used for determining the current detection result based on the updated decoding sequence characteristic.

When the detection module is a transform decoder, each element in the decoded sequence features of the decoder is embedded as Object Content (Object Content Embedding) and updated by each decoder layer, and for each decoder layer, the current decoded sequence features are obtained, and the Self-attention unit Self-attention in the decoder layer extracts the Self-attention features of the current decoded sequence features and inputs the Self-attention features into the Cross-attention unit Cross-attention. The cross attention unit in this embodiment replaces the attention mechanism with the square complexity with the efficient attention mechanism with the linear complexity, that is, the cross attention feature is extracted from the self-attention feature and the input image feature based on a feature mapping manner, so as to reduce the computational complexity, and specifically, the feature mapping is implemented by a linear mapping function or a nonlinear mapping function. It should be noted that the efficient attention mechanism of linear complexity adopted by the cross attention unit in the present embodiment depends on the input characteristics of the input sparse sampling. In the case of dense features in the decoder, the attention matrix obtained by the cross attention unit is correspondingly a high-dimensional matrix, which results in the Linear mapping function Linear needing to predict the high-dimensional attention matrix from the single object content query embedding. On one hand, the parameter quantity of the model is obviously improved, and on the other hand, the model is easy to over-fit and difficult to train. The input features in the application are sparse features after sparse sampling, and the technical problem is solved.

The feedforward network unit may include a plurality of network layers, and may be, for example, a fully-connected layer including a nonlinear activation function, which is used to perform nonlinear transformation on a feature updated by the cross attention mechanism, and specifically, may perform nonlinear transformation on a decoder sequence in which image features after sparse sampling are aggregated, to obtain an updated decoded sequence feature. The updated decoding sequence characteristic may be used as a decoding sequence characteristic for a next decoder layer.

The detection unit may include a full connection layer, and is configured to detect the target object for the updated decoded sequence feature and output a detection result.

And under the condition that the detection result output by each sub-detection module is obtained, determining a loss function based on the detection result output by each sub-detection module and the labeled data of the sample data, wherein each detection result comprises a detection position and a detection classification probability, and the detection position can be determined through a detection frame. The annotation data of the sample data may include a standard position and a classification type of the target object, wherein the standard position of the target object may be determined by an external frame of the target object in the sample data.

Optionally, determining the loss function includes: determining a loss item based on the detection result output by each sub-detection module and the labeled data of the sample data; and determining a loss function based on the loss terms corresponding to the sub-detection modules. In this embodiment, the type of loss function used to determine the loss term is not limited, and in some embodiments, the type of loss function may include, but is not limited to, at least one of the following: exponential loss functions, hind loss functions, cross entropy loss functions, squared loss functions, log logarithmic loss functions, and the like. And determining loss items corresponding to the sub-detection modules based on the detection result output by each sub-detection module and the labeled data of the sample data through any one or more loss function types, wherein the loss function types corresponding to different loss items can be the same or different. And determining the sum or the weighted sum of all the loss terms as a final loss function for adjusting the network parameters of the target detection model. Specifically, the network parameter adjustment of the direction of the target detection model may be performed based on a gradient descent method.

And iteratively executing the training process to obtain a trained target detection model so as to perform target detection on the image to be detected.

On the basis of the above embodiment, a preferred example of a training method of the target detection model is also provided. In the preferred embodiment, the target detection model is composed of a basic feature extractor of a Convolutional Neural Network (CNN) and a Transformer network structure. The basic feature extractor and the Transformer encoder form a feature extraction module, the Transformer decoder serves as a detection module, and correspondingly, each decoder layer of the Transformer decoder serves as each sub-detection module. For example, referring to fig. 3, fig. 3 is a schematic flowchart of a target detection model in a training process according to an embodiment of the present invention.

The basic feature extractor extracts two-dimensional basic features of input sample data (e.g., sample image), which can be expressed as

Wherein

Flattening the two-dimensional basic features into a one-dimensional sequence and performing feature mapping to obtain

Having a length of

The dimension of the embedded feature is D. The base feature extractor inputs the extracted features into a transform encoder, the TransThe nsformer encoder consists of L_encThe encoder layers are stacked. Representing the input features of a transform encoder as z₀Each encoder layer

Output of last encoder layer

And the position code p is used as input, and the coded sequence characteristics are output

Namely, it is

Outputting one-dimensional image sequence characteristics by the last encoder layer of the Transformer encoder

I.e. the initial image features.

The transform decoder consists of L_decA stack of decoder layers, the decoder input having an initial sequence of q₀Each element in the sequence is also referred to as Object Content Embedding (Object Content Embedding). Each decoder layer DecoderLayer_lOutput sequence characteristics of the last decoder layer, i.e. the current decoded sequence characteristics

Decoder sequence position coding p_qInitial image characteristics of encoder output

And encoder sequence position encoding p as input, outputting decoded sequence features, i.e. updating decoded sequence features

Including self-attention between object content embedding in each decoder layerA force unit Self-attention, a Cross attention unit Cross-attention of object content embedding and image feature sequences, and a feed-forward neural network for object content embedding. The cross attention unit adopts an efficient attention mechanism with linear complexity, for example, see fig. 4, and fig. 4 is a schematic diagram of the efficient attention mechanism based on sparse features provided by the embodiment of the present invention. The attention matrix is predicted directly from the embedding of the object content by utilizing a linear mapping layer on the basis of using sparse image characteristics, namely:

wherein the content of the first and second substances,

the attention matrix output for the cross attention unit in decoder layer i, i.e. the cross attention feature. The feedforward neural network aggregates input image features, namely image features after sparse sampling, with linear weighted information features based on cross attention features to obtain updated decoding sequence features.

Each decoder layer also comprises a detection unit of the full-connection layer to output the detection result of the decoder layer, such as a position vector and a category vector, wherein the position vector represents the detection frame, and the category vector comprises the classification probability that the target object belongs to each classification type.

And the sparse sampling model in the target detection model performs sparse sampling on the initial image features based on the detection frame in the detection result of each decoder layer, and the sparse image features are used as the input image features of the next decoder layer.

And determining a loss function for the output detection result of each decoder layer and the labeling result of the sample data to reversely adjust the network parameters in the target detection model, and iteratively executing the process until the trained target detection model is obtained.

Fig. 5 is a schematic flowchart of a target detection method provided in an embodiment of the present invention, where this embodiment is applicable to a case of performing target detection on an image to be detected, and the method may be executed by a target detection apparatus provided in an embodiment of the present invention, where the target detection apparatus may be implemented by software and/or hardware, and the target detection apparatus may be configured on an electronic computing device, and specifically includes the following steps:

and S210, acquiring an image to be detected.

S220, inputting the image to be detected into a pre-trained target detection model to obtain the target position and the target classification in the image to be detected.

The target detection model comprises a feature extraction module and a detection module, the detection module comprises a plurality of sub-detection modules which are connected in sequence, the feature extraction module extracts initial image features and inputs the initial image features to the detection module, the sub-detection modules in the detection module are used for determining the features of input images, detection results of the sub-detection modules are based on the detection results of the sub-detection modules are right, the initial image features are subjected to sparse sampling, sampling features are input to the next sub-detection module as the features of the input images, and the target positions and the targets are classified until the output of the terminal sub-detection modules is ready to detect the target positions and the targets in the images.

Optionally, the target detection model further includes a sparse sampling module, where the sparse sampling module is connected to each of the sub-detection modules, and is configured to receive a detection result of the previous sub-detection module, perform sparse sampling on the initial image feature based on the detection result, and input the sampling feature to the next sub-detection module.

It should be noted that, in this embodiment, the target detection model is obtained by training based on the training method of the target detection model provided in any of the above embodiments.

According to the technical scheme provided by the embodiment, in the processing process of the target detection model to be processed, sparse sampling is carried out on the initial image features based on the detection result of the previous sub-detection module among the sub-detection modules in the detection module to obtain sparse features processed by the next sub-detection module, so that redundant features processed by the next sub-detection module are reduced, the pertinence and the accuracy of the input image features of the next sub-detection module are provided, and the processing efficiency and the detection accuracy of the target detection model are improved.

On the basis of the foregoing embodiment, a training device for a target detection model is further provided, referring to fig. 6, fig. 6 is a schematic structural diagram of a training device for a target detection model according to an embodiment of the present invention, where the training device is configured to obtain training sample data, perform iterative training on a target detection model to be trained based on the training sample, and obtain a trained target detection model when a training end condition is met. The device includes:

a detection result determining module 310, configured to input the sample data to the target detection model to be trained, and obtain a detection result output by each sub-detection module in the target detection model, where the target detection model includes a feature extraction module and a detection module, the detection module includes multiple sub-detection modules connected in sequence, and in a process of processing the sample data by the target detection model to be trained, the feature extraction module extracts an initial image feature and inputs the initial image feature to the detection module, and the sub-detection module in the detection module is configured to determine a detection result of the sub-detection module for an input image feature, perform sparse sampling on the initial image feature based on the detection result of the sub-detection module, and input a sampling feature as an input image feature to a next sub-detection module;

a loss function determining module 320, configured to determine a loss function based on the detection result output by each sub-detection module and the labeled data of the sample data;

a network parameter adjusting module 330, configured to adjust a network parameter in the target detection model based on the loss function.

Optionally, the detection result includes a detection frame and a classification probability;

the sparse sampling module is to: and respectively carrying out sparse sampling on the initial image characteristics and the position coding data of the characteristic extraction module based on the detection frame to obtain image sampling characteristics and coding sampling characteristics.

Optionally, the sub-detection module includes a self-attention unit, a cross-attention unit, a feed-forward network unit, and a detection unit;

wherein the self-attention unit is used for extracting self-attention features of the current decoding sequence features;

the cross attention unit is used for extracting cross attention features from the self attention features and the input image features based on a linear mapping mode;

the feedforward network unit is used for obtaining an updated decoding sequence characteristic based on the cross attention characteristic;

the detection unit is used for determining the current detection result based on the updated decoding sequence characteristic.

Optionally, the feature extraction module includes a basic feature extractor and an encoder, where the basic feature extractor is configured to extract a two-dimensional basic feature of sample data, and the encoder is configured to perform feature extraction on the two-dimensional basic feature and output a one-dimensional image sequence feature.

Optionally, the loss function determining module 330 is configured to determine a loss term based on the detection result output by each sub-detection module and the labeled data of the sample data; and determining a loss function based on the loss terms corresponding to the sub-detection modules.

The training device for the target detection model provided by the embodiment of the invention can execute the training method for the target detection model provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the training method for executing the target detection model.

On the basis of the above embodiment, there is also provided an object detection apparatus, referring to fig. 7, where fig. 7 is a schematic structural diagram of an object detection apparatus provided in an embodiment of the present invention, the apparatus includes:

an image obtaining module 410, configured to obtain an image to be detected;

the target detection module 420 is configured to input the image to be detected into a pre-trained target detection model, so as to obtain a target position and a target classification in the image to be detected;

The target detection device provided by the embodiment of the invention can execute the target detection method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects for executing the target detection method.

FIG. 8 illustrates a schematic diagram of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 8, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM)12, a Random Access Memory (RAM)13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM)12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as a training method for an object detection model, and/or the object detection method.

In some embodiments, the method of training the object detection model, and/or the object detection method, may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. The training method of the object detection model described above, and/or one or more steps of the object detection method, may be performed when the computer program is loaded into the RAM 13 and executed by the processor 11. Alternatively, in other embodiments, the processor 11 may be configured by any other suitable means (e.g., by means of firmware) to perform a training method of the object detection model, and/or the object detection method.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for training a target detection model, comprising:

2. The method according to claim 1, wherein the target detection model further comprises a sparse sampling module, and the sparse sampling module is respectively connected to each of the sub-detection modules, and is configured to receive a detection result of a previous sub-detection module, perform sparse sampling on the initial image features based on the detection result, and input sampled features to a next sub-detection module.

3. The method of claim 1, wherein the detection result comprises a detection box and a classification probability;

the sparsely sampling the initial image features based on the detection results includes:

and respectively carrying out sparse sampling on the initial image characteristics and the position coding data of the characteristic extraction module based on the detection frame to obtain image sampling characteristics and coding sampling characteristics.

4. The method of claim 1, wherein the sub-detection modules comprise a self-attention unit, a cross-attention unit, a feed-forward network unit, and a detection unit;

the cross attention unit is used for extracting cross attention features from the self attention features and the input image features based on a feature mapping mode;

5. The method according to claim 1, wherein the feature extraction module comprises a basic feature extractor and an encoder, wherein the basic feature extractor is configured to extract two-dimensional basic features of sample data, and the encoder is configured to perform feature extraction on the two-dimensional basic features to output one-dimensional image sequence features.

6. The method according to claim 1, wherein the determining a loss function based on the detection result output by each sub-detection module and the labeled data of the sample data comprises:

determining a loss item based on the detection result output by each sub-detection module and the labeled data of the sample data;

and determining a loss function based on the loss terms corresponding to the sub-detection modules.

7. A method of object detection, comprising:

8. An apparatus for training an object detection model, the apparatus being configured to:

the device includes:

9. An object detection device, comprising:

the image acquisition module is used for acquiring an image to be detected;

10. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform a method of training an object detection model as claimed in any one of claims 1 to 6, and/or a method of object detection as claimed in claim 7.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer instructions for causing a processor to implement, when executed, the method of training an object detection model according to any one of claims 1-6 and/or the method of object detection according to claim 7.