CN113762051B

CN113762051B - Model training method, image detection device, storage medium and equipment

Info

Publication number: CN113762051B
Application number: CN202110523328.1A
Authority: CN
Inventors: 诸加丹
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-05-13
Filing date: 2021-05-13
Publication date: 2024-05-28
Anticipated expiration: 2041-05-13
Also published as: CN113762051A

Abstract

The embodiment of the invention discloses a model training method, an image detection device, a storage medium and equipment, wherein a first characteristic diagram is obtained by acquiring an image sample after being processed by a first detection model, and a second characteristic diagram is obtained by acquiring an image sample after being processed by a second detection model; calculating a first training loss parameter corresponding to the object position contained in the image sample based on the first feature map and the second feature map; adjusting the first training loss parameters according to the position of the guiding training area of the image sample to obtain second training loss parameters corresponding to the image sample; and adjusting the network parameters of the second detection model according to the second training loss parameters to obtain an adjusted second detection model. According to the method, a machine learning technology is adopted, so that the second detection model learns to the first detection model in the position of the guide training area, the detection performance of the second detection model can be pertinently improved, and the detection accuracy of the second detection model is further improved.

Description

Model training method, image detection device, storage medium and equipment

Technical Field

The invention relates to the technical field of computers, in particular to a model training method, an image detection device, a storage medium and equipment.

Background

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary that can be applied in model training. Specifically, machine learning may learn based on sample data, and update the model according to the learning result to obtain a model with perfect performance.

In actual use, each model is intended to serve a particular application to perform a particular function. Models deployed into applications are generally expected to use less computing resources (memory space, computing units, etc.) and to incur lower latency, and knowledge distillation methods of the models are thus developed. The knowledge distillation method of a model is a model compression/training method, and specifically a method in which a student model (small model) improves the performance of the student model by mimicking a teacher model (large model).

However, in the detection model, the vast majority of training data is background characteristics, and the target characteristics occupy a small proportion, so that the effect of optimizing the student model by adopting a knowledge distillation method is poor.

Disclosure of Invention

The embodiment of the application provides a model training method, an image detection device, a storage medium and equipment.

The first aspect of the present application provides a model training method, including:

Acquiring a first feature map obtained by processing an image sample through a first detection model, and acquiring a second feature map obtained by processing the image sample through a second detection model;

Calculating a first training loss parameter corresponding to the object position contained in the image sample based on the first feature map and the second feature map;

adjusting the first training loss parameters according to the position of the guiding training area of the image sample to obtain second training loss parameters corresponding to the image sample;

and adjusting the network parameters of the second detection model according to the second training loss parameters to obtain an adjusted second detection model.

The second aspect of the present application provides an image processing method, comprising:

acquiring an image to be detected;

And detecting the image to be detected by adopting a preset image detection model to obtain a detection result, wherein the preset image detection model is the adjusted second detection model trained according to the model training method provided by the first aspect.

Accordingly, a third aspect of the present application provides a model training apparatus, comprising:

The acquisition unit is used for acquiring a first characteristic image obtained by processing an image sample through a first detection model and acquiring a second characteristic image obtained by processing the image sample through a second detection model;

The calculating unit is used for calculating a first training loss parameter corresponding to the position of the target object in the image sample based on the first characteristic diagram and the second characteristic diagram;

the first adjusting unit is used for adjusting the first training loss parameters according to the position of the guiding training area of the image sample to obtain second training loss parameters corresponding to the position of the target object;

And the second adjusting unit is used for adjusting the network parameters of the second detection model according to the second training loss parameters to obtain an adjusted second detection model.

In some embodiments, the first adjustment unit includes:

the first acquisition subunit is used for acquiring the labeling position of the foreground object in the image sample;

the first determining subunit is used for determining the position of the guide training area corresponding to the image sample according to the labeling position;

and the first adjusting subunit is used for adjusting the first training loss parameter according to the position of the guiding training area to obtain a second training loss parameter corresponding to the image sample.

In some embodiments, the apparatus further comprises:

The second acquisition subunit is used for acquiring a first detection position corresponding to the foreground object obtained by detecting the image sample by the first detection model;

The first determining subunit is further configured to:

And determining the position of the guide training area corresponding to the image sample according to the labeling position and the first detection position.

In some embodiments, the apparatus further comprises:

the third acquisition subunit is used for acquiring a second detection position of the foreground object obtained by detecting the image sample by the second detection model;

The first determining subunit is further configured to:

and determining the position of the guide training area corresponding to the image sample according to the marking position, the first detection position and the second detection position.

In some embodiments, the second acquisition subunit comprises:

the detection module is used for detecting the object of the first feature map to obtain a plurality of detection results, wherein the detection results comprise object information and the position of the object;

The labeling module is used for labeling rectangular frames corresponding to foreground objects in the first feature map according to the detection result and determining coordinate information of each rectangular frame;

and the determining module is used for determining the first detection position according to the coordinate information of each rectangular frame.

In some embodiments, the labeling module comprises:

the acquisition sub-module is used for acquiring the probability that each object belongs to a foreground object in the detection result;

The first determining submodule is used for determining that the object with the probability higher than the preset threshold value is a foreground object and labeling a corresponding rectangular frame of each foreground object;

and the second determination submodule is used for determining coordinate information of each rectangular frame.

In some embodiments, the computing unit comprises:

a second adjustment subunit, configured to adjust a size of the second feature map to be the same as a size of the first feature map, to obtain a third feature map;

And the calculating subunit is used for calculating the mean square error of the first feature map and the third feature map to obtain a first training loss parameter corresponding to the image sample.

In some embodiments, the second adjusting unit includes:

an updating subunit, configured to update a loss function of the second detection model according to the second training loss parameter, to obtain a target loss function;

The processing subunit is used for carrying out gradient descent processing on the target loss function until the target loss function converges;

And the second determination subunit is used for determining the second detection model when the target loss function converges as the adjusted second detection model.

A fourth aspect of the present application provides an image processing apparatus comprising:

the acquisition unit is used for acquiring the image to be detected;

The detection unit is used for detecting the image to be detected by adopting a preset image detection model to obtain a detection result, wherein the preset image detection model is the adjusted second detection model trained according to the model training method provided by the first aspect.

The fifth aspect of the present application also provides a computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the model training method provided in the first aspect of the present application or the image detection method provided in the second aspect.

A sixth aspect of the embodiments of the present application provides a computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the model training method provided in the first aspect or the image detection method provided in the second aspect when the computer program is executed.

A seventh aspect of the embodiments of the present application provides a computer program product or computer program comprising computer instructions stored in a storage medium. The computer instructions are read from a storage medium by a processor of a computer device, which executes the computer instructions, causing the computer device to perform the steps of the model training method provided in the first aspect or the image detection method provided in the second aspect.

According to the model training method provided by the embodiment of the application, the first characteristic diagram obtained by processing the image sample through the first detection model and the second characteristic diagram obtained by processing the image sample through the second detection model are obtained; calculating a first training loss parameter corresponding to the object position contained in the image sample based on the first feature map and the second feature map; adjusting the first training loss parameters according to the position of the guiding training area of the image sample to obtain second training loss parameters corresponding to the image sample; and adjusting the network parameters of the second detection model according to the second training loss parameters to obtain an adjusted second detection model. Therefore, the second detection model learns to the first detection model in the position of the guide training area, so that the detection performance of the second detection model can be pertinently improved, and the detection accuracy of the second detection model is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of a scenario of model training provided by the present application;

FIG. 2 is a flow chart of the model training method provided by the application;

FIG. 3 is another flow chart of the model training method provided by the present application;

FIG. 4 is a schematic representation of the calculation of simulated losses in the present application;

FIG. 5 is a schematic diagram of a model training apparatus provided by the present application;

Fig. 6 is a schematic structural diagram of a computer device provided by the present application.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The embodiment of the invention provides a model training method, an image detection device, a storage medium and equipment. The model training method can be used in a model training device. The model training apparatus may be integrated in a computer device, which may be a terminal or a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, network acceleration services (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform.

Referring to fig. 1, a schematic view of a model training scenario provided by the present application is shown; in order to obtain a model with good performance and small volume, a small volume model to be trained can be used for learning the processing process of the trained general volume model by adopting a knowledge distillation method. Specifically, as shown in fig. 1, the first detection model is a large volume model that has been trained, and the second detection model is a small volume model to be trained. And respectively adopting the first detection model and the second detection model to extract the characteristics of the image sample to obtain a first characteristic diagram corresponding to the first detection model and a second characteristic diagram corresponding to the second detection model. And then calculating a first training loss parameter corresponding to the image sample according to the first feature map and the second feature map, and adjusting the first training loss parameter according to the position of the guide training area to obtain a second training loss parameter. And adjusting network parameters of the second detection model by adopting the second training loss parameters to obtain an adjusted second detection model, thereby completing training of the second detection model.

It should be noted that, the schematic view of the model training scenario shown in fig. 1 is only an example, and the model training scenario described in the embodiment of the present application is to more clearly illustrate the technical solution of the present application, and does not constitute a limitation to the technical solution provided by the present application. Those skilled in the art can know that with the evolution of the model training method and the appearance of new business scenes, the technical scheme provided by the application is also applicable to similar technical problems.

The following describes the above-described embodiments in detail.

Embodiments of the present application will be described in terms of a model training apparatus that may be integrated into a computer device. The computer device may be a terminal or a server.

As shown in fig. 2, a flow chart of a model training method provided by the application is shown, and the method comprises the following steps:

Step 101, obtaining a first feature map obtained by processing an image sample through a first detection model, and obtaining a second feature map obtained by processing an image sample through a second detection model.

The image sample may be any one image sample in a training image sample set, and each image sample in the training image sample set is labeled with a position of a foreground contained in the image sample, where the foreground is an object to be detected. For example, an image sample for training animals in the image includes images of animals such as cats, dogs, cows, horses, and squirrels, and images of objects such as trees, grasslands, lakes, sky, and clouds. Then in this image sample, cats, dogs, cattle, horses, and rats are foreground, while trees, grasslands, lakes, sky, and clouds are background. In the image samples, the positions of the animals, namely cats, dogs, cows, horses, and squirrels, are marked. The position of the foreground may be a specific position of the foreground in the image sample, or may be coordinates and size data of a rectangular frame containing the foreground. Each foreground has its corresponding position, and there may or may not be overlapping between the positions of different foreground.

The first detection model is a trained detection model, and the body volume of the first detection model is large, so that more storage space and more calculation units are needed for deploying the first detection model in application, and more calculation resources are occupied. The second detection model can be an untrained detection model or a detection model subjected to preliminary training, and the second detection model has smaller volume, so that the second detection model is deployed in application to occupy less calculation resources. In the application, the first detection model can also be called a teacher model, the second detection model can also be called a student model, and the knowledge distillation method is utilized to enable the second detection model to learn model features in the first detection model so as to improve the detection capability of the second detection model.

Both the first detection model and the second detection model can be image detection models, and the image detection belongs to the computer vision technology. Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing to make the Computer process an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others. In particular, the first detection model may be a large model of the YOLO series target object detection model and the second detection model may be a small model of the YOLO series target object detection model. Where YOLO (You Only Look Once, you look only once) is a target detection algorithm.

When the detection model is adopted to detect the image, the image is firstly subjected to feature extraction to obtain a feature map, and then a specific target object and the position of the object are further identified from the feature map. In the application, for any image sample in an image sample set, a first detection model is adopted to detect the image sample, a first feature image is firstly extracted from the image sample, and then a detection head is used to detect the first feature image to obtain a first detection result; and detecting the image sample by adopting a second detection model, extracting a second characteristic image from the image sample, and then detecting the second characteristic image by using a detection head to obtain a second detection result. The first detection result comprises the category of foreground objects and the position information of each foreground object, and the position information of each foreground object forms a first detection position corresponding to the foreground object; the second detection result also comprises a foreground object category and position information of each foreground object, and the position information of each foreground object forms a second detection position corresponding to the foreground object. The first detection result and the second detection result may also have a difference due to the difference in detection capability between the first detection model and the second detection model, that is, the foreground object detected in the first detection result may not necessarily be detected in the second detection result, and the position information of the same foreground object in the first detection result may not necessarily be the same as the position information in the second detection result.

Step 102, calculating a first training loss parameter corresponding to the object position contained in the image sample based on the first feature map and the second feature map.

The object contained in the calculated image sample contains both a foreground object, namely a target object to be detected, and a background object. That is, the training loss parameter between the first feature map and the second feature map is calculated based on the dimensions of the global object in the image sample. The first training loss parameter may be an absolute error (or referred to as L1 loss) between the feature parameters of the first feature map and the second feature map, or may be a mean square error (or referred to as L2 loss) between the feature parameters of the first feature map and the second feature map. Since the derivative of the L1 loss at the 0 point is not unique, convergence may be affected, so in order to improve the efficiency of model training, the application uses the L2 loss between the first characteristic diagram and the second characteristic diagram as a first training loss function. The characteristic parameters of the characteristic map comprise width, height and depth of the characteristic map.

After the first training loss parameter is obtained through calculation, gradient descent treatment can be performed on the first training loss parameter until the first training loss parameter converges, and the difference between the first characteristic diagram and the second characteristic diagram is continuously reduced, so that the process of learning the second detection model to the first detection model is realized, and the process is a knowledge distillation process. However, the first training loss parameter is calculated based on the dimension of the global object in the image sample, so that gradient descent processing is directly performed on the first training loss parameter, and the second detection model is learned from the first detection model in the global dimension. However, there is often more background than foreground in the image sample, so there is also often more background features than foreground features in the extracted feature map. If the first detection model is learned in the global dimension, more background features are learned, so that the improvement effect of the target detection capability is greatly reduced. Therefore, in the embodiment of the present application, further processing is required for the first training loss parameter.

In some embodiments, calculating a first training loss parameter corresponding to a position of an object contained in the image sample based on the first feature map and the second feature map includes:

1. The size of the second characteristic diagram is adjusted to be the same as that of the first characteristic diagram, and a third characteristic diagram is obtained;

2. And calculating the mean square error of the first feature map and the third feature map to obtain a first training loss parameter corresponding to the image sample.

In the embodiment of the application, because the first detection model and the second detection model are inconsistent, the sizes of the first feature map and the second feature map obtained by feature extraction of the first feature map and the second feature map aiming at the same image sample are inconsistent, and the calculation of the first training loss parameter between the first feature map and the second feature map is further affected. In contrast, the present application adjusts the sizes of the first feature map and the second feature map to be the same before calculating the first training loss parameter between the first feature map and the second feature map. The size of the second feature map is adjusted to obtain a third feature map with the same size as the first feature map, and then L2 loss between the first feature map and the third feature map is calculated to obtain a first training loss parameter.

Specifically, the first feature map and the second feature map are resized to be the same size, and a convolution operation may be introduced, using which the first feature map is resized to be the same size as the second feature map. For example, the first feature map has a size of [256,52,52] and the second feature map has a size of [128,52,52], then the convolution function is determined to be conv=conv2d (in_channels=128, out_channels=256, kernel_size=3, stride=1). Where in_channels is the channel input, out_channels is the channel output, kernel_size is the convolution kernel size, stride is the step size. And carrying out convolution processing on the second characteristic diagram by adopting the convolution function to obtain a third characteristic diagram, wherein the size of the third characteristic diagram is [256,52,52], and the size of the third characteristic diagram is the same as that of the first characteristic diagram.

And step 103, adjusting the first training loss parameters according to the position of the guiding training area of the image sample to obtain second training loss parameters corresponding to the image sample.

The first training loss parameter calculated in step 102 is that the first feature map and the second feature map are based on L2 loss in the global dimension of the image sample, and optimizing the L2 loss only makes the second detection model globally approach to the first detection model. And because the background is often more than the foreground in the image sample, the second detection model learns the characteristics of more background areas, and the effect of improving the foreground detection capability is further weakened. In the application, the first training loss parameter is adjusted by adopting the position of the guiding training area of the image sample, so that the L2 loss of the first characteristic diagram and the second characteristic diagram at the position of the guiding training area, namely the second training loss parameter, is obtained.

The position of the guiding training area can be the position of a designated part or all of foreground objects marked in the image sample, and the position of the foreground objects marked in the image sample can be the specific position of the foreground image or the position of a rectangular frame containing the foreground image in the image sample. In some embodiments, the guiding training area position may further include a position of a foreground image in a detection result obtained by detecting the image sample by the first detection model. In other embodiments, the guiding the training area position may further include a position of a foreground image in a detection result obtained by detecting the image sample by the second detection model. And adjusting the first training loss parameter by adopting the position of the guide training area to obtain L2 loss of the first characteristic diagram and the second characteristic diagram on the position of the guide training area, namely the second training loss parameter. And at the moment, optimizing the second training loss parameters to enable the second detection model to approach the first training model at the position of the guiding training area, so that the effect of learning the first training model at the position of the guiding training area is achieved. Since the position of the training area is guided to contain the position of the foreground object in the image sample, excessive background information is prevented from being learned in the knowledge distillation process, and the learned attention is focused on the position of the foreground object, so that the detection capability of the second detection model on the foreground object can be pertinently improved.

In some embodiments, adjusting the first training loss parameter according to the position of the guiding training area of the image sample to obtain a second training loss parameter corresponding to the image sample includes:

1. acquiring a labeling position of a foreground object in an image sample;

2. Determining the position of a guide training area corresponding to the image sample according to the labeling position;

3. And adjusting the first training loss parameters according to the positions of the guide training areas to obtain second training loss parameters corresponding to the image samples.

The foreground object is a target object to be detected in the image sample. For example, in an image of an animal, animals such as dogs, cats, cows, sheep, etc. are foreground objects, and grasslands, rivers, etc. in the image are background objects. For example, in a scene image in a first person shooting game, objects such as a person, an automobile, and an airplane are foreground objects, and houses, roads, and the like are background objects. For example, in an image in which a television station logo or watermark is detected, the television station logo and watermark are foreground, and the program content being played is background.

Since the image sample is a training image sample marked with the position of the foreground object, the marked position of the foreground object can be extracted from the image sample. The labeling position can also be understood as the actual position of the foreground object in the image sample. Labeling foreground objects in an image sample is often performed by labeling rectangular frames containing the foreground objects. In the method, the labeling position of the foreground object in the image sample is obtained by dividing the image sample by taking pixels as units, then establishing a coordinate system by taking the size of a pixel block as units, and determining the vertex coordinates of a rectangular frame of each labeled foreground object and the size of the side length of the rectangular frame. The vertex coordinates of the rectangular frame corresponding to each foreground object and the size of the side length of the rectangular frame form the labeling position of the foreground object.

Further, the L2 loss of the first feature map and the second feature map calculated in step 102 is calculated in the global dimension of the image sample. The L2 loss is optimized, so that the student model approaches to the teacher model, most of learning of the student model to the teacher model is background, and the improvement effect on the target detection capability of the student model is not great. Therefore, the student model needs to be focused on the position where the teacher model learns, and in the application, the position of the guide training area can be set, so that the student model learns to the teacher model at the position of the guide training area, and the detection capability of the student model on the object guiding the position of the guide training area is improved pertinently. Further, the position of the guide training area can be determined according to the labeling position of the foreground object in the image sample. Therefore, the student model learns to the teacher model in the area, and the capability of the student model for detecting the foreground object in the image sample can be pertinently improved.

In some embodiments, the model training method provided by the present application may further include:

1. Acquiring a first detection position corresponding to a foreground object obtained by detecting an image sample by a first detection model;

2. And determining the position of the guide training area corresponding to the image sample according to the labeling position and the first detection position.

The result of the teacher model detecting the image sample is the probability that each region in the image sample is a foreground object, and then the region with higher probability can be determined as the position corresponding to the foreground object, and the position can be recorded as the first detection position corresponding to the foreground object. For example, after the teacher model detects the first image sample, 5 detection results are obtained. Each detection result corresponds to a target object, each detection result comprises the probability that the target object is a foreground object, and each detection result further comprises the position information of each target object. If the probabilities of the 5 target objects being foreground objects are 0.1, 0.3, 0.9, 0.99, and 0.95, respectively, it may be determined whether the target objects are foreground objects according to a preset probability threshold (e.g., 0.8). As described above, three target objects corresponding to probability values of 0.9, 0.99 and 0.95 can be determined as foreground objects, and objects with probability values of 0.1 and 0.3 as background objects. Thus, the first detection position corresponding to the foreground object obtained by detecting the image sample by the first detection model is the position corresponding to the three target objects with probability values of 0.9, 0.99 and 0.95.

In the embodiment of the application, the position of the guide training area for guiding the student model to learn from the teacher model is the union of the labeling position and the first detection position. Because the labeling position is the true position of the foreground object, the labeling position is a definite position, or the output result is hard; the teacher model detects the image sample to obtain a detection result which is a probability value that a certain area in the image is a foreground object, and the output result is soft. The teacher model is a trained large model, the detection result of the foreground object is accurate, but the detection position obtained by detecting a certain foreground object has slight deviation from the real position of the foreground object. This is also caused by the model detection output being softer. Because the first detection position and the real position of the foreground object detected by the teacher model are both accurate positions of the foreground object, the embodiment of the application determines the union of the labeling position and the first detection position as the position of the guiding training area, the training area can be further enlarged, and the student model is guided to learn the teacher model in a larger area, so that the detection capability of the student model is further improved.

1.1, acquiring a second detection position of a foreground object obtained by detecting an image sample by a second detection model;

And 1.2, determining the position of the guide training area corresponding to the image sample according to the marking position, the first detection position and the second detection position.

In the embodiment of the application, the position of the guide training area is a position determined by combining the labeling position, the first detection position and the second detection position of the foreground object, which is determined by detecting the image sample by the student model. Since the student model is a small model, the performance is poor, and the detection result can have false detection, for example, background is detected as a foreground. The guiding training area positions in the embodiment of the application comprise the misdetection positions, the detection result of the teacher model is more accurate, and the misdetection positions can be clearly determined as the background. Thus, L2 loss of the feature maps of the teacher model and the student model is calculated at these mismeasured positions, and the L2 loss is continuously optimized, that is, the feature maps of the student model and the teacher model are kept as consistent as possible at these positions. Namely, the detection results of the student models at the positions are consistent with the detection results of the teacher models at the positions, and the student models recognize the positions as the background.

On the other hand, the student model also has the problem of missed detection, namely, the foreground object is not recognized as the background. The positions of the guide training area comprise the labeling positions and the first detection positions, so that the feature images of the student model and the teacher model can be kept consistent at the positions. I.e. to enable the student model to recognize these positions as foreground.

Therefore, the training guide area is determined to be the union of the labeling position, the first detection position and the second detection position, the student model and the teacher model can be kept consistent on the accurate foreground object position, the student model and the teacher model can be kept consistent on the student model misdetection position, misdetection of the student model is avoided, the detection capability of the student model is further improved, and the detection accuracy of the student model is improved.

And step 104, adjusting the network parameters of the second detection model according to the second training loss parameters to obtain an adjusted second detection model.

After determining the L2 loss of the first feature map and the second feature map at the position of the guiding training area (i.e., the second training loss parameter, which may also be referred to as a simulation loss), the network parameters (or referred to as model parameters) of the second detection model need to be adjusted according to the simulation loss, so that the second detection model simulates the detection capability of the first detection model at the position of the guiding training area.

In some embodiments, since the first feature map and the second feature map are feature maps obtained by extracting features from the image samples by the first detection model and the second detection model, L2 loss of the first feature map and the second feature map at the position of the guiding training area is also a simulation loss corresponding to the image samples. Since a set of training image sets is generally set when training the image detection model, the training image sets generally include a plurality of training images. Then in the present application, the number of image samples may be plural, and each image sample may determine its corresponding imitation loss according to the method of steps 101 to 103. In order to further improve the learning effect of the second detection model and avoid calculation deviation of the imitation losses caused by a single image sample, after determining the imitation losses corresponding to each image sample, average calculation can be performed on the imitation losses corresponding to all the image samples, and then the final target imitation losses of the second detection model are obtained. And finally, adjusting network parameters of the second detection model according to the target simulation loss obtained by averaging to obtain an adjusted second detection model.

In some embodiments, adjusting the network parameters of the second detection model according to the second training loss parameters to obtain an adjusted second detection model includes:

1. Updating the loss function of the second detection model according to the second training loss parameter to obtain a target loss function;

2. Gradient descent treatment is carried out on the target loss function until the target loss function converges;

3. and determining the second detection model when the target loss function converges as the adjusted second detection model.

The loss function of the second detection model is calculated when the second detection model is trained according to a training image set formed by a plurality of image samples. After the second training loss parameter is obtained through calculation, the loss function of the second detection model is updated by using the second training loss parameter, and an updated target loss function is obtained. The variable parameters in the target loss function comprise network parameters of the second detection model, gradient descent processing is carried out on the target loss function, the network parameters of the second detection model are determined to be final model parameters of the second detection model when the target loss function converges, and the adjusted second detection model is obtained, so that training of the second detection model is completed.

According to the above description, in the model training method provided by the embodiment of the application, the first feature map obtained by processing the image sample through the first detection model is obtained, and the second feature map obtained by processing the image sample through the second detection model is obtained; calculating a first training loss parameter corresponding to the object position contained in the image sample based on the first feature map and the second feature map; adjusting the first training loss parameters according to the position of the guiding training area of the image sample to obtain second training loss parameters corresponding to the image sample; and adjusting the network parameters of the second detection model according to the second training loss parameters to obtain an adjusted second detection model. Therefore, the second detection model learns to the first detection model in the position of the guide training area, so that the detection performance of the second detection model can be pertinently improved, and the detection accuracy of the second detection model is further improved.

Correspondingly, the embodiment of the application further describes the model training method provided by the application in detail from the perspective of computer equipment, wherein the computer equipment can be a terminal or a server. As shown in fig. 3, another flow chart of the model training method provided by the present application includes:

In step 201, a computer device obtains a first feature map obtained by processing an image sample with a first detection model and a second feature map obtained by processing an image sample with a second detection model.

The first detection model and the second detection model may be image detection models deployed in a computer device. In particular, the first detection model may be a large model in a YOLO series image target detection model and the second detection model may be a small model in a YOLO series image target detection model. The first detection model can be a trained mature model, the accuracy of the first detection model on object detection in the image sample is high, but the operation amount of the first detection model is large, and the detection efficiency is low. The second detection model can be an untrained model, and the knowledge distillation method is adopted to enable the second detection model to learn from the first detection model so as to improve the detection capability of the second detection model on the target. Thus, in the present application, the first detection model may also be referred to as a teacher model, and the second detection model may also be referred to as a student model.

The image samples may be training image samples labeled with foreground objects and foreground object positions in the image. Carrying out feature extraction on the same image sample by using a teacher model to obtain a first feature map, wherein the first feature map can be marked as f _t; and extracting the characteristics by using the student model to obtain a second characteristic diagram, wherein the second characteristic diagram can be marked as f _s.

At step 202, the computer device calculates an L2 loss between the first feature map and the second feature map.

Since the teacher model and the student model are different models, the sizes of the first feature map and the second feature map obtained by extracting features from the same image sample by the two models may be different. When the two feature sizes cannot be matched, the L2 loss between the two feature sizes cannot be calculated. Therefore, before calculating the L2 loss between the first and second feature maps, the second feature map f _s may be adjusted in size so that the size is the same as the size of the first feature map f _t. And recording the characteristic diagram obtained after the second characteristic diagram f _s is adjusted as f _adpt. Therefore, calculating the L2 loss between the first feature map and the second feature map is calculating the L2 loss between f _t and f _adpt, and is denoted as the first training loss parameter.

Where L2 loss=l2= ||f _t-f_adpt||₂, i.e. L2 loss between the first and second feature maps is the square of the difference between f _t and f _adpt.

In step 203, the computer device obtains a first detection position corresponding to the foreground object obtained by detecting the image sample by the first detection model.

In the present application, the L2 loss of the first feature map and the second feature map calculated in step 202 is calculated in the global dimension of the image sample. The L2 loss optimization process of the global dimension causes the improvement of the detection capability of the foreground position to be influenced due to the interference of a large number of background phonemes. Therefore, the embodiment of the application calculates the L2 loss at the first feature map and the second feature map at the position of the guide training area, optimizes the L2 loss at the position of the guide training area of the two feature maps, and can pertinently improve the target detection capability of the student model at the position of the guide training area. In the application, the position of the guiding training area can be composed of the position with larger probability of foreground objects detected by the teacher model on the image sample, the position with larger probability of foreground objects detected by the student model on the image sample and the labeling position of the foreground objects in the image sample. Therefore, in the application, the position with larger probability of foreground object detected by the teacher model on the image sample, the position with larger probability of foreground object detected by the student model on the image sample and the labeling position of the foreground object in the image sample need to be acquired one by one.

In some embodiments, obtaining a first detection position corresponding to a foreground object obtained by detecting an image sample by using a first detection model includes:

A. Object detection is carried out on the first feature map, so that a plurality of detection results are obtained, wherein the detection results comprise object information and the position of an object;

B. Labeling rectangular frames corresponding to foreground objects in the first feature map according to the detection result and determining coordinate information of each rectangular frame;

C. and determining a first detection position according to the coordinate information of each rectangular frame.

The method comprises the steps that a teacher model detects an image sample, the position of a foreground object in the image sample is determined, and specifically, the teacher model performs feature extraction on the image sample to obtain a first feature map; and then detecting the feature map by adopting a detection head of the model to obtain a detection result. Similarly, the student model can also adopt the same method to obtain the detection result corresponding to the student model. The detection result contains the object in the characteristic diagram and the position of each object. And then extracting a detection result corresponding to the foreground object in the detection result, marking rectangular frames containing the position of the foreground object, and determining coordinate information of each rectangular frame. The coordinate information may be coordinate information of a rectangular frame determined by establishing a coordinate system with any vertex of the feature map as a coordinate origin. Further, the width value and the height value of the rectangular frame may be determined, and the first detection position is a position determined by the coordinate information of the rectangular frame and the width value and the height value of the rectangular frame. When a plurality of foreground objects exist, the corresponding rectangular frames are also a plurality of, and the position set determined by the coordinate information of the rectangular frames forms a first detection position.

Specifically, the teacher model performs feature extraction on the image sample to obtain a first feature map f _t,f_t with a size w×h×c, where W is a width of the feature map, H is a height of the feature map, and C is a depth of the feature map. The feature map f _s is passed through the detection head of the model to obtain a detection result, and the size of the detection result (model_out) is w×h×anchors_n× classes _n+1+4). Wherein anchors_n is the number of anchor frames on the corresponding feature map, and the anchor frames are the rectangular frames. classes _n is the number of categories, i.e., the probability of each category, such as the probability of the area being a cat, or the probability of the area being a dog, etc. Dimension "1" represents the probability p that the target is foreground; dimension "4" represents the position of the rectangular box (x, y, w, h), where (x, y) is the coordinate position of any given vertex of the rectangular box, w is the width of the rectangular box, and h is the height of the rectangular box. For example, when the anchors_n is 3, that is, three rectangular boxes are determined in the first feature map f _t, then class_n is the probability that the object in each rectangular box is some object, and the probabilities that the objects corresponding to the three rectangular boxes are foreground are p ₁、p₂ and p ₃, respectively. The positions of the three rectangular boxes are (x ₁,y₁,w₁,h₁)、(x₂,y₂,w₂,h₂) and (x ₃,y₃,w₃,h₃), respectively.

In some embodiments, labeling rectangular frames corresponding to foreground objects in the first feature map according to detection results and determining coordinate information of each rectangular frame includes:

a. obtaining the probability that each object belongs to a foreground object in the detection result;

b. Determining an object with probability higher than a preset threshold as a foreground object, and labeling a corresponding rectangular frame of each foreground object;

c. Coordinate information of each rectangular frame is determined.

The probability that each object belongs to the foreground object in the detection result is p corresponding to the rectangular frame where the object is located, as described above. A confidence value may be set, and when p is higher than the confidence value, the object is determined to be a foreground object; when p is not higher than the confidence value, the object is determined to be a background object. Then, the rectangular frames of the foreground object can be determined for marking, and the coordinate information and the size of each rectangular frame are determined, namely the position of each rectangular frame is determined. The union of the positions of the rectangular frames is the first detection position obtained by detecting the image sample by the teacher model.

In step 204, the computer device obtains a second detection position corresponding to the foreground object obtained by detecting the image sample by the second detection model.

After the first detection position of the foreground object obtained by the teacher model for detecting the image sample is obtained, the second detection position of the foreground object obtained by the student model for detecting the image sample is further obtained. The second detection position corresponding to the foreground object detected by the student model may be obtained by using the same method for detecting and determining the first detection position of the foreground object in step 203.

In some embodiments, the rectangular box corresponding to each foreground object may be further enlarged, for example, by 1.2 times, and the union formed by the positions of the enlarged rectangular boxes may be determined to be the first detection position. Because the rectangular frame is only one part of the picture, the rectangular frame is enlarged, and the learning area can be enlarged, so that the detection capability of the student model is further enhanced.

In step 205, the computer device determines a position of the guide training area according to the first detection position, the second detection position and the labeling position of the foreground object in the image sample.

In the application, the position of the guiding training area is the union of the first detection position, the second detection position and the labeling position of the foreground object in the image sample. The training area position is guided to not only comprise the position with higher foreground probability of the teacher model for detecting and determining the image sample and the position with higher foreground probability of the student model for detecting and determining the image sample, but also comprise the labeling position of the foreground object in the image sample, namely the real position of the foreground object in the image sample.

At step 206, the computer device calculates L2 loss of the first feature map and the second feature map at the location of the pilot training area, resulting in a simulated loss.

In order to train the student model in a targeted manner, the detection capability of the student model at the foreground position is improved, and the target detection capability of the student model is further improved. It is necessary to learn the student model to the teacher model at the instruction training area position. Specifically, L2 loss of the first feature map and the second feature map at the position of the guide training area may be calculated, where L2 loss of the first feature map and the second feature map at the position of the guide training area may be taken as a simulated loss. And then, optimizing the imitation loss so that the first characteristic diagram and the second characteristic diagram are consistent in the position of the guiding training area, and further realizing that the student model learns from the teacher model in the position of the guiding training area.

Specifically, the L2 loss of the first feature map and the second feature map at the position of the guide training area may be calculated according to the L2 loss of the first feature map and the second feature map and the position of the guide training area. The specific calculation formula is shown as the following formula (1):

Where L _lmitation is the simulated loss, W is the width of the feature map, H is the height of the feature map, and C is the depth of the feature map. That is, the feature map may be divided into W unit sizes in width and H unit sizes in height. M is a guide training area, M _i,j is a position with the width dimension of i and the height dimension of j on the guide training area. N _M is a summation formula, specifically shown in the following formula (2):

thus, L2 loss of the first feature map and the second feature map at the position of the guide training area is calculated, namely the imitation loss.

The calculation of the simulated loss of the student model and the teacher model according to the present application will be further described in the following figures. Fig. 4 is a schematic diagram illustrating a calculation process of the loss simulation in the present application, where an image sample is subjected to feature extraction by a teacher model to obtain a first feature map f _t, and then the target detection is continuously performed on the first feature map to obtain a foreground object in the image sample and a position of the foreground object in the image sample. In the figure, the oblique line shading circles are foreground objects detected by the teacher model, and the positions of the foreground objects in the image sample are first detection positions. Similarly, the image sample is subjected to feature extraction through a student model to obtain a second feature map f _s, and then target detection is continuously carried out on the second feature map to obtain a foreground object in the image sample and the position of the foreground object in the image sample. The circles without shading in the figure are foreground objects detected by the student model, and the positions of the foreground objects in the image sample are second detection positions. As can be seen from the figure, since the detection capability of the student model and the teacher model are different, the number of foreground objects detected by the student model and the teacher model are different (missing detection exists in the student model), and the positions of each foreground object are also different.

The simulated loss is L2 loss of the first feature map f _t of the teacher model and the second feature map f _s of the student model at the position of the guide training area. Therefore, L2 loss of the first feature map f _t and the second feature map f _s on the image sample global is calculated first. However, since there is a difference between the student model and the teacher model, the sizes of the first feature map f _t and the second feature map f _s are also different, and the size of the second feature map f _s needs to be adjusted first, so as to obtain a third feature map f _adpt with the same size as the first feature map f _t. And then calculating L2 loss according to the first feature map f _t and the third feature map f _adpt to obtain L2 loss of the first feature map and the second feature map on the image sample global, namely a first training loss parameter. And then determining the position of the guide training area, wherein the guide training area is a union set of the first detection position, the second detection position and the labeling position. In the figure, the vertical line ground mark circles are foreground objects marked in the image sample, and guide the training area position, namely the set of positions of all circles in the image sample. And finally, calculating according to the first training loss parameters and the positions of the guide training areas to obtain the imitation losses of the student model and the teacher model.

In step 207, the computer device adjusts model parameters of the second detection model according to the simulated loss, resulting in an adjusted second detection model.

After the simulation loss of the first feature map and the second feature map on the position of the guiding training area is calculated, the simulation loss can be used for training the second detection model (namely the student model), so that the student model learns to the teacher model on the position of the guiding training area, and the feature map of the student model is consistent with the feature map of the teacher model on the position of the guiding training area. Thereby improving the detection capability of the student model to the foreground object.

In particular, the original loss function of the student model may be adjusted using the simulated loss. The following formula (3):

L=l _gt+λ*L_lmitation (3)

Wherein, L is the updated loss function of the student model, L _gt is the original loss function of the student model, namely the loss function of the student model when training the student model by using training images normally, and the loss function comprises regression loss and classification loss. Lambda is a super parameter and can be set according to the needs of users. L _lmitation is the simulated loss.

Then, gradient descent processing is carried out on the loss function L after updating the student model until the loss function L converges. And determining the model parameters of the student model when the loss function L converges as the final model parameters of the student model to obtain the trained student model.

According to the above description, according to the model training method provided by the embodiment of the application, a first feature map obtained by processing an image sample through a first detection model is obtained, and a second feature map obtained by processing an image sample through a second detection model is obtained; calculating a first training loss parameter corresponding to the object position contained in the image sample based on the first feature map and the second feature map; adjusting the first training loss parameters according to the position of the guiding training area of the image sample to obtain second training loss parameters corresponding to the image sample; and adjusting the network parameters of the second detection model according to the second training loss parameters to obtain an adjusted second detection model. Therefore, the second detection model learns to the first detection model in the position of the guide training area, so that the detection performance of the second detection model can be pertinently improved, and the detection accuracy of the second detection model is improved.

The model training method provided by the application can be used for optimizing the detection model so as to realize the detection capability of a large model by using a small model. In particular, the detection model may be an image detection model. Therefore, the application also provides an image detection method, which comprises the following steps:

acquiring an image to be detected;

Detecting an image to be detected by adopting a preset image detection model to obtain a detection result; the preset image detection model is an adjusted second detection model obtained by training according to any model training method provided by the application.

The image to be detected can be stored on the blockchain, the image to be detected can be obtained from each node of the blockchain, and after detection is completed, a detection result corresponding to the image to be detected is returned to the corresponding blockchain node for storage.

In order to better implement the above method, the embodiment of the present invention further provides a model training apparatus, which may be integrated in a server.

For example, as shown in fig. 5, a schematic structural diagram of a model training apparatus according to an embodiment of the present application may include an obtaining unit 301, a calculating unit 302, a first adjusting unit 303, and a second adjusting unit 304, as follows:

An obtaining unit 301, configured to obtain a first feature map obtained by processing an image sample with a first detection model, and obtain a second feature map obtained by processing an image sample with a second detection model;

a calculating unit 302, configured to calculate a first training loss parameter corresponding to a target object position in the image sample based on the first feature map and the second feature map;

the first adjusting unit 303 is configured to adjust the first training loss parameter according to the position of the guiding training area of the image sample, so as to obtain a second training loss parameter corresponding to the position of the target object;

The second adjusting unit 304 is configured to adjust the network parameters of the second detection model according to the second training loss parameters, so as to obtain an adjusted second detection model.

In some embodiments, the first adjustment unit comprises:

and the first adjusting subunit is used for adjusting the first training loss parameters according to the position of the guide training area to obtain second training loss parameters corresponding to the image samples.

In some embodiments, the apparatus further comprises:

the first determining subunit is further configured to:

In some embodiments, the apparatus further comprises:

the first determining subunit is further configured to:

and determining the position of the guide training area corresponding to the image sample according to the labeling position, the first detection position and the second detection position.

In some embodiments, the second acquisition subunit comprises:

In some embodiments, the labeling module comprises:

In some embodiments, the computing unit comprises:

the second adjusting subunit is used for adjusting the size of the second characteristic diagram to be the same as the size of the first characteristic diagram to obtain a third characteristic diagram;

In some embodiments, the second adjustment unit comprises:

The updating subunit is used for updating the loss function of the second detection model according to the second training loss parameter to obtain a target loss function;

In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.

As can be seen from the above, in the model training device provided by the embodiment of the present application, the acquiring unit 301 acquires the first feature map obtained by processing the image sample with the first detection model, and acquires the second feature map obtained by processing the image sample with the second detection model; the calculation unit 302 calculates a first training loss parameter corresponding to the object position contained in the image sample based on the first feature map and the second feature map; the first adjusting unit 303 adjusts the first training loss parameter according to the position of the guiding training area of the image sample, so as to obtain a second training loss parameter corresponding to the image sample; the second adjusting unit 304 adjusts the network parameters of the second detection model according to the second training loss parameters, so as to obtain an adjusted second detection model. Therefore, the second detection model learns to the first detection model in the position of the guide training area, so that the detection performance of the second detection model can be pertinently improved, and the detection accuracy of the second detection model is further improved.

Correspondingly, in order to better implement the image detection method, the embodiment of the invention also provides an image detection device, which specifically comprises:

the acquisition unit is used for acquiring the image to be detected;

The detection unit is used for detecting the image to be detected by adopting a preset image detection model to obtain a detection result; the preset image detection model is an adjusted second detection model which is obtained by training according to any model training method.

The specific functions of each module are consistent with the steps in the image detection method, and are not described herein.

The embodiment of the application also provides computer equipment which can be a terminal or a server. The server may be any node on the blockchain. Fig. 6 is a schematic structural diagram of a computer device according to the present application. Specifically, the present application relates to a method for manufacturing a semiconductor device.

The computer device may include one or more processing cores 'processors 401, one or more storage media's memory 402, power supply 403, and input unit 404, among other components. Those skilled in the art will appreciate that the computer device structure shown in FIG. 6 is not limiting of the computer device and may include more or fewer components than shown, or may be combined with certain components, or a different arrangement of components. Wherein:

the processor 401 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and model training by running the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, a web page access, etc.), and the like; the storage data area may store data created according to the use of the computer device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.

The computer device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of charge, discharge, and power consumption management may be performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The computer device may also include an input unit 404, which input unit 404 may be used to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the computer device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the computer device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions as follows:

Acquiring a first feature map obtained by processing an image sample through a first detection model, and acquiring a second feature map obtained by processing an image sample through a second detection model; calculating a first training loss parameter corresponding to the object position contained in the image sample based on the first feature map and the second feature map; adjusting the first training loss parameters according to the position of the guiding training area of the image sample to obtain second training loss parameters corresponding to the image sample; and adjusting the network parameters of the second detection model according to the second training loss parameters to obtain an adjusted second detection model.

Or: acquiring an image to be detected;

It should be noted that, the computer device provided in the embodiment of the present application and the model training method in the above embodiment belong to the same concept, and the specific implementation of each operation above may refer to the foregoing embodiment, which is not described herein.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present invention provide a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any of the model training methods provided by embodiments of the present invention. For example, the instructions may perform the steps of:

Or: acquiring an image to be detected;

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

The instructions stored in the storage medium can execute the steps in any model training method provided by the embodiment of the present invention, so that the beneficial effects that any model training method provided by the embodiment of the present invention can achieve can be achieved, see the previous embodiments in detail, and are not repeated here.

Wherein according to an aspect of the application, a computer program product or a computer program is provided, the computer program product or computer program comprising computer instructions stored in a storage medium. The processor of the computer device reads the computer instructions from the storage medium, and the processor executes the computer instructions, so that the computer device performs the steps of the model training method or the image detection method provided in the above-described various alternative implementations.

The foregoing describes in detail a model training method, an image detection method, an apparatus, a storage medium and a device provided by the embodiments of the present invention, and specific examples are applied to illustrate the principles and embodiments of the present invention, where the foregoing examples are only used to help understand the method and core idea of the present invention; meanwhile, as those skilled in the art will vary in the specific embodiments and application scope according to the ideas of the present invention, the present description should not be construed as limiting the present invention in summary.

Claims

1. A method of model training, the method comprising:

The first training loss parameters are adjusted according to the position of a guiding training area of the image sample, so that second training loss parameters corresponding to the image sample are obtained, and the position of the guiding training area comprises the position of a foreground object in the image sample;

2. The method according to claim 1, wherein the adjusting the first training loss parameter according to the position of the guiding training area of the image sample to obtain the second training loss parameter corresponding to the image sample includes:

acquiring a labeling position of a foreground object in the image sample;

Determining the position of a guide training area corresponding to the image sample according to the labeling position;

And adjusting the first training loss parameters according to the position of the guide training area to obtain second training loss parameters corresponding to the image samples.

3. The method according to claim 2, wherein the method further comprises:

Acquiring a first detection position corresponding to a foreground object obtained by detecting the image sample by the first detection model;

the determining the position of the guiding training area corresponding to the image sample according to the labeling position comprises the following steps:

4. A method according to claim 3, characterized in that the method further comprises:

Acquiring a second detection position of a foreground object obtained by detecting the image sample by the second detection model;

The determining the position of the guiding training area corresponding to the image sample according to the labeling position and the first detection position comprises the following steps:

5. A method according to claim 3, wherein the obtaining a first detection position corresponding to a foreground object obtained by detecting the image sample by the first detection model includes:

performing object detection on the first feature map to obtain a plurality of detection results, wherein the detection results comprise object information and the position of an object;

Labeling rectangular frames corresponding to foreground objects in the first feature map according to the detection result and determining coordinate information of each rectangular frame;

and determining a first detection position according to the coordinate information of each rectangular frame.

6. The method according to claim 5, wherein labeling rectangular boxes corresponding to foreground objects in the first feature map and determining coordinate information of each rectangular box according to the detection result comprises:

Obtaining the probability that each object belongs to a foreground object in the detection result;

Determining an object with probability higher than a preset threshold as a foreground object, and labeling a corresponding rectangular frame of each foreground object;

Coordinate information of each rectangular frame is determined.

7. The method of claim 1, wherein the calculating a first training loss parameter corresponding to the object position contained in the image sample based on the first feature map and the second feature map comprises:

The size of the second feature map is adjusted to be the same as that of the first feature map, and a third feature map is obtained;

and calculating the mean square error of the first feature map and the third feature map to obtain a first training loss parameter corresponding to the image sample.

8. The method of claim 1, wherein adjusting the network parameters of the second detection model according to the second training loss parameters results in an adjusted second detection model, comprising:

updating the loss function of the second detection model according to the second training loss parameter to obtain a target loss function;

Gradient descent processing is carried out on the target loss function until the target loss function converges;

And determining the second detection model when the target loss function converges as an adjusted second detection model.

9. An image detection method, the method comprising:

acquiring an image to be detected;

detecting the image to be detected by using a preset image detection model to obtain a detection result, wherein the preset image detection model is the adjusted second detection model trained according to the model training method of any one of claims 1 to 7.

10. A model training apparatus, the apparatus comprising:

The first adjusting unit is used for adjusting the first training loss parameters according to the position of a guiding training area of the image sample to obtain second training loss parameters corresponding to the position of the target object, wherein the position of the guiding training area comprises the position of the foreground object in the image sample;

11. The apparatus of claim 10, wherein the first adjustment unit comprises:

The acquisition subunit is used for acquiring the labeling position of the foreground object in the image sample;

The determining subunit is used for determining the position of the guide training area corresponding to the image sample according to the labeling position;

And the adjustment subunit is used for adjusting the first training loss parameter according to the position of the guiding training area to obtain a second training loss parameter corresponding to the image sample.

12. An image processing apparatus, comprising:

the acquisition unit is used for acquiring the image to be detected;

the detection unit is configured to detect the image to be detected by using a preset image detection model to obtain a detection result, where the preset image detection model is the adjusted second detection model obtained by training according to the model training method of any one of claims 1 to 7.

13. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the model training method of any one of claims 1 to 8 or the image detection method of claim 9.

14. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the model training method of any one of claims 1 to 8 or the image detection method of claim 9 when the computer program is executed.

15. A computer program product, characterized in that it comprises computer instructions stored in a storage medium, from which a processor of a computer device reads, which processor executes the computer instructions, so that the computer device performs the steps of the model training method of any one of claims 1 to 8 or the image detection method of claim 9.