CN117036855A

CN117036855A - Object detection model training method, device, computer equipment and storage medium

Info

Publication number: CN117036855A
Application number: CN202310972379.1A
Authority: CN
Inventors: 赵紫州; 秦诗玮; 黄迎; 李翰良; 褚英昊; 吕君钰; 孔祥义; 王兴照
Original assignee: Shenzhen Weiai Intelligent Technology Co ltd
Current assignee: Shenzhen Weiai Intelligent Technology Co ltd
Priority date: 2023-08-02
Filing date: 2023-08-02
Publication date: 2023-11-10

Abstract

The application relates to a target detection model training method, a device, a computer device, a storage medium and a computer program product. The method comprises the following steps: acquiring a first target detection model obtained according to a first sample image in a first scene; the first sample image is a sample image carrying a label; acquiring a second initial sample image in a second scene; the second initial sample image is a sample image without a tag; performing label prediction processing on the second initial sample image based on the first target detection model to obtain a second target sample image carrying a prediction label; obtaining a target detection teacher model corresponding to the second scene according to the second target sample image carrying the prediction tag and the first target detection model; and carrying out knowledge distillation on the target detection student model corresponding to the second scene based on the target detection teacher model to obtain a second target detection model in the second scene. By adopting the method, the generalization capability of the target detection model can be improved.

Description

Object detection model training method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of object detection technology, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for training an object detection model.

Background

Object detection, one of the core problems in the field of computer vision, has been widely used in various scenes such as the field of industrial production, the field of transportation, the field of medical treatment, and the like.

In the related art, a target detection model in a certain scene is usually obtained by training based on a sample image in the scene; therefore, when the external environment changes greatly, for example, the target detection model is deployed in different scenes, the performance of the target detection model is reduced due to the difference between the scene in which the target detection model is obtained by training and the scene in which the target detection model is deployed. Therefore, the generalization ability of the target detection model obtained based on the related art training is weak.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a target detection model training method, apparatus, computer device, computer readable storage medium, and computer program product capable of improving generalization ability of a target detection model, in order to solve the above-described technical problem of poor generalization ability of the target detection model.

In a first aspect, the present application provides a method for training a target detection model. The method comprises the following steps:

acquiring a first target detection model obtained according to a first sample image in a first scene; the first sample image is a sample image carrying a label;

acquiring a second initial sample image in a second scene; the second scene is a different scene than the first scene; the second initial sample image is a sample image without a label;

performing label prediction processing on the second initial sample image based on the first target detection model to obtain a second target sample image carrying a prediction label;

obtaining a target detection teacher model corresponding to the second scene according to the second target sample image carrying the prediction tag and the first target detection model;

and carrying out knowledge distillation on the target detection student model corresponding to the second scene based on the target detection teacher model to obtain a second target detection model in the second scene.

In one embodiment, the performing label prediction processing on the second initial sample image based on the first target detection model to obtain a second target sample image carrying a prediction label includes:

Determining a predictive label of each second initial sample image based on the first target detection model;

and identifying the second initial sample image with the confidence coefficient of the corresponding predictive label being greater than or equal to a preset confidence coefficient threshold value from the second initial sample images as a second target sample image carrying the predictive label.

In one embodiment, the obtaining, according to the second target sample image carrying the prediction tag and the first target detection model, a target detection teacher model corresponding to the second scene includes:

training the first target detection model based on the second target sample image carrying the prediction tag to obtain a trained first target detection model;

taking a second initial sample image with the confidence coefficient of the corresponding predictive label smaller than the preset confidence coefficient threshold value in each second initial sample image as the second initial sample image, taking the trained first target detection model as the first target detection model, and returning to the step of determining the predictive label of each second initial sample image based on the first target detection model until the trained first target detection model meets preset training conditions;

And determining the trained first target detection model meeting the preset training conditions as a target detection teacher model corresponding to the second scene.

In one embodiment, the performing knowledge distillation on the object detection student model corresponding to the second scene based on the object detection teacher model to obtain a second object detection model under the second scene includes:

performing target detection on the second target sample image based on the target detection teacher model to obtain a teacher label of the second target sample image, and performing target detection on the second target sample image based on the target detection student model to obtain a student label of the second target sample image;

training the target detection student model based on first label difference information between a teacher label and a student label of the second target sample image and second label difference information between a student label and a prediction label of the second target sample image to obtain a second target detection model in the second scene.

In one embodiment, the performing knowledge distillation on the object detection student model corresponding to the second scene based on the object detection teacher model to obtain a second object detection model under the second scene further includes:

Determining teacher output characteristics of the second target sample image corresponding to an intermediate network layer of the target detection teacher model, and determining student output characteristics of the second target sample image corresponding to an intermediate network layer of the target detection student model;

and training the intermediate network layer of the target detection student model based on the characteristic difference information between the teacher output characteristics and the student output characteristics to obtain a second target detection model in the second scene.

obtaining teacher labels of all second target sample images based on the target detection teacher model, and obtaining student labels of all second target sample images based on the target detection student model;

extracting features of the relation among the teacher labels to obtain teacher label relation information, and extracting features of the relation among the student labels to obtain student label relation information;

And training the target detection student model based on the relation difference information between the teacher label relation information and the student label relation information to obtain a second target detection model in the second scene.

In a second aspect, the application further provides a target detection model training device. The device comprises:

the first model training module is used for acquiring a first target detection model obtained according to a first sample image in a first scene; the first sample image is a sample image carrying a label;

the sample image acquisition module is used for acquiring a second initial sample image in a second scene; the second scene is a different scene than the first scene; the second initial sample image is a sample image without a label;

the sample label prediction module is used for carrying out label prediction processing on the second initial sample image based on the first target detection model to obtain a second target sample image carrying a prediction label;

the detection model migration module is used for obtaining a target detection teacher model corresponding to the second scene according to the second target sample image carrying the prediction tag and the first target detection model;

And the second model training module is used for carrying out knowledge distillation on the target detection student model corresponding to the second scene based on the target detection teacher model to obtain a second target detection model under the second scene.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

The object detection model training method, the object detection model training device, the computer equipment, the storage medium and the computer program product firstly acquire a first object detection model obtained according to a first sample image in a first scene; the first sample image is a sample image carrying a label; then acquiring a second initial sample image in a second scene; the second scene is a different scene than the first scene; the second initial sample image is a sample image without a tag; then, based on the first target detection model, carrying out label prediction processing on the second initial sample image to obtain a second target sample image carrying a prediction label; then, according to a second target sample image carrying the prediction tag and the first target detection model, a target detection teacher model corresponding to the second scene is obtained; and finally, based on the target detection teacher model, performing knowledge distillation on the target detection student model corresponding to the second scene to obtain a second target detection model in the second scene. In this way, the prediction label of the second initial sample image in the second scene is obtained through the first target detection model in the first scene, label marking of the second initial sample image can be achieved, then the target detection teacher model corresponding to the second scene is obtained, then knowledge distillation is carried out on the target detection student model corresponding to the second scene through the target detection teacher model corresponding to the second scene, and under the condition that the target detection effect is ensured, a second target detection model with lighter weight can be obtained, so that migration of the target detection model in different scenes is achieved, and generalization capability of the target detection model in different scenes is improved.

Drawings

FIG. 1 is a flow chart of a method for training a target detection model in one embodiment;

FIG. 2 is a flowchart illustrating steps for obtaining a target detection teacher model corresponding to a second scenario in an embodiment;

FIG. 3 is a flowchart illustrating steps for obtaining a second object detection model in a second scenario in one embodiment;

FIG. 4 is a flowchart illustrating a step of obtaining a second object detection model in a second scenario according to another embodiment;

FIG. 5 is a flowchart illustrating a step of obtaining a second object detection model in a second scenario according to another embodiment;

FIG. 6 is a flow chart of a method for training a target detection model according to another embodiment;

FIG. 7 is a flow diagram of a method of constructing a semi-supervised knowledge distillation framework for universal robotic vision system optimization, in one embodiment;

FIG. 8 is a block diagram of a target detection model training apparatus in one embodiment;

fig. 9 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

In an exemplary embodiment, as shown in fig. 1, a method for training a target detection model is provided, and the method is applied to a server for illustration in this embodiment; it will be appreciated that the method may also be applied to a terminal, and may also be applied to a system comprising a server and a terminal, and implemented by interaction between the server and the terminal. The server can be realized by an independent server or a server cluster formed by a plurality of servers; the terminal may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, etc. In this embodiment, the method includes the steps of:

step S102, a first target detection model obtained according to a first sample image in a first scene is obtained.

The first sample image is a sample image carrying a label; the label of the first sample image is obtained by manual labeling.

The first target detection model is a target detection model obtained based on training of a currently commonly used target detection algorithm, and the currently commonly used target detection algorithm may be an R-CNN series algorithm (Region-Convolutional Neural Networks, a target detection algorithm based on a convolutional neural network), a YOLO series algorithm (You Only Look Once, a target detection algorithm based on a convolutional neural network), an SSD algorithm (Single Shot MultiBox Detector, a single-polygonal-frame target detection algorithm), and the like.

Specifically, a server firstly collects a plurality of sample images in a first scene, and manually labels the plurality of sample images in the first scene to obtain a plurality of first sample images in the first scene; then, the server takes a plurality of first sample images as a training set, and trains the training set based on a currently commonly used target detection algorithm, such as a YOLO series algorithm, so as to obtain a first target detection model in a first scene.

Step S104, a second initial sample image in a second scene is acquired.

Wherein the second scene is a different scene than the first scene; for example, the first scene is a security monitoring scene, and the second scene is an industrial quality inspection scene; for another example, the first scene is an industrial quality inspection scene that detects a product defect on a production line, and the second scene is an industrial quality inspection scene that detects a part assembly position in the production process.

Wherein the second initial sample image is a sample image that does not carry a label.

Specifically, when the server detects that the first target detection model in the first scene needs to be applied to the second scene, the server firstly acquires a plurality of sample images in the second scene as a plurality of second initial sample images in the second scene. It can be appreciated that the acquisition process of the sample image does not involve a labeling process of the sample image, and thus the second initial sample image is a sample image that does not carry a label.

For example, assume that a first scene is a security monitoring scene, that is, a first target detection model is used for detecting whether a person entering a certain area has access rights, and assume that a second scene is an industrial quality inspection scene, that is, a second target detection model required by the second scene is used for detecting product defects, assembly errors and the like on a production line; when the server detects that a first target detection model in the security monitoring scene needs to be applied to the industrial quality inspection scene, the server firstly collects a plurality of sample images in the industrial quality inspection scene to serve as a second initial sample image.

And S106, performing label prediction processing on the second initial sample image based on the first target detection model to obtain a second target sample image carrying a prediction label.

Specifically, the server inputs a plurality of second initial sample images into a first target detection model, and outputs a plurality of pseudo tags of each second initial sample image and confidence degrees corresponding to the pseudo tags through the first target detection model; then, for each second initial sample image, the server determines the corresponding pseudo tag with the highest confidence coefficient from the plurality of pseudo tags of the second initial sample image as the prediction tag of the second initial sample image, so as to obtain a second target sample image carrying the prediction tag.

It can be understood that the core task of the target detection is to judge the position and type of the object, and the pseudo tag corresponds to the identification result (i.e. type) of the target detection model; taking a first scenario as an example of an industrial quality inspection scenario for detecting defects of products on a production line, in this scenario, the core task of object detection is to detect whether a product is defective or not, and thus, the recognition result of the first object detection model includes a true label corresponding to the presence of a defect and a false label corresponding to the absence of a defect (true and false are used to indicate logically yes or no). Taking an industrial quality inspection scene for detecting the assembly position of a part in a production process as an example, in the second scene, the core task of target detection is to detect whether each assembly position is the assembly position of a specific part, so that for a sample image of each assembly position, a pseudo label of the sample image comprises a true label corresponding to the defect and a pseudo label corresponding to the defect in a first target detection model.

For example, assuming that the confidence of the true label is 0.8 and the confidence of the false label is 0.4 in a plurality of false labels of a sample image of a certain position to be assembled, the true label is taken as a predictive label of the sample image of the position to be assembled, and based on the predictive label, the position to be assembled as a part assembling position of a specific part can be primarily determined.

And step S108, obtaining a target detection teacher model corresponding to the second scene according to the second target sample image carrying the prediction label and the first target detection model.

Specifically, the server adds the second target sample image carrying the prediction tag to a training set of the first target detection model, and trains the first target detection model based on the updated training set to obtain a target detection teacher model corresponding to the second scene.

It can be understood that, since the updated training set includes both the first sample image and the second target sample image, and the target detection teacher model obtained based on the updated training set and the first target detection model can realize target detection in the first scene and target detection in the second scene, the model volume of the target detection teacher model is larger, the model structure is also more complex, that is, the redundancy of the target detection teacher model is higher, so that the operation speed of the target detection teacher model is slower.

And step S110, based on the target detection teacher model, performing knowledge distillation on the target detection student model corresponding to the second scene to obtain a second target detection model in the second scene.

In the field of neural networks, knowledge refers to the weights and deviations learned. Knowledge distillation is a model compression method; knowledge distillation can migrate knowledge to a student model with a simpler structure without changing the knowledge in the teacher model, thereby achieving compression of the model.

The object detection student model is a model with a simpler model structure compared with the object detection teacher model.

Specifically, the server migrates knowledge in the target detection teacher model to the target detection student model with a simpler model structure based on knowledge distillation, so as to compress the target detection teacher model, and further obtain a second target detection model in a second scene. After obtaining the second target detection model in the second scene, the server may deploy the second target detection model in the second scene to implement target detection in the second scene, and under the condition that it is detected that the second target detection model needs to be migrated to a third scene different from both the first scene and the second scene, take the second target detection model as the first target detection model, take the third scene as the second scene, and return to step S104 "obtain a second initial sample image in the second scene", thereby obtaining the target detection model in the new three scenes.

Wherein the knowledge can be classified into response-based knowledge, feature-based knowledge, and relationship-based knowledge; the knowledge based on the response is used for representing the influence of the output layer of the model on the output result of the model; the knowledge based on the characteristics is used for representing the influence of an intermediate network layer of the model on an output result of the model; the relationship-based knowledge is used to characterize the effect of relationships between the output results of the model on the output results of the model.

For example, the server may perform different types of knowledge distillation on the target detection student model based on different types of knowledge, may perform three types of knowledge distillation on the target detection student model sequentially according to a certain sequence, and may perform three types of knowledge distillation on the target detection student model simultaneously.

In the target detection model training method provided in the above embodiment, a server first acquires a first target detection model obtained according to a first sample image in a first scene; the first sample image is a sample image carrying a label; then acquiring a second initial sample image in a second scene; the second scene is a different scene than the first scene; the second initial sample image is a sample image without a tag; then, based on the first target detection model, carrying out label prediction processing on the second initial sample image to obtain a second target sample image carrying a prediction label; then, according to a second target sample image carrying the prediction tag and the first target detection model, a target detection teacher model corresponding to the second scene is obtained; and finally, based on the target detection teacher model, performing knowledge distillation on the target detection student model corresponding to the second scene to obtain a second target detection model in the second scene. In this way, the server obtains the prediction label of the second initial sample image in the second scene through the first target detection model in the first scene, label marking of the second initial sample image can be achieved, then the target detection teacher model corresponding to the second scene is obtained, then the server carries out knowledge distillation on the target detection student model corresponding to the second scene through the target detection teacher model corresponding to the second scene, and under the condition that the target detection effect is ensured, a second target detection model with lighter weight can be obtained, so that migration of the target detection model in different scenes is achieved, and generalization capability of the target detection model in different scenes is improved.

In an exemplary embodiment, the step S106 performs label prediction processing on the second initial sample image based on the first target detection model to obtain a second target sample image carrying a predicted label, which specifically includes the following steps: determining a predictive label of each second initial sample image based on the first target detection model; and identifying the second initial sample image with the confidence coefficient of the corresponding predictive label being greater than or equal to a preset confidence coefficient threshold value from the second initial sample images as a second target sample image carrying the predictive label.

Specifically, the server inputs a plurality of second initial sample images into a first target detection model, and outputs a plurality of pseudo tags of each second initial sample image and confidence degrees corresponding to the pseudo tags through the first target detection model; then, for each second initial sample image, the server determines a corresponding pseudo tag with highest confidence coefficient from a plurality of pseudo tags of the second initial sample image as a prediction tag of the second initial sample image; and then, the server distinguishes the second initial sample image into a second initial sample image with the confidence coefficient of the corresponding prediction label being larger than or equal to a preset confidence coefficient threshold value and a second initial sample image with the confidence coefficient of the corresponding prediction label being smaller than the preset confidence coefficient threshold value according to the confidence coefficient of the prediction label corresponding to each second initial sample image, and takes the second initial sample image with the confidence coefficient of the corresponding prediction label being larger than or equal to the preset confidence coefficient threshold value as a second target sample image carrying the prediction label, and discards the prediction label of the second initial sample image with the confidence coefficient of the corresponding prediction label being smaller than the preset confidence coefficient threshold value.

For example, assuming that the preset confidence threshold is 0.75, the confidence corresponding to the prediction label of the first second initial sample image is 0.85, and the confidence corresponding to the prediction label of the second initial sample image is 0.55, the server determines the first second initial sample image as the second target sample image carrying the prediction label, and discards the prediction label of the second initial sample image, i.e. the second initial sample image does not have the corresponding prediction label.

In addition, the determination of the predictive label for each second initial sample image needs to be satisfied as in the public domain

Tag predictive loss function shown in equation 1:

wherein L is _p Representing a label predictive loss function, L (·) representing a supervised learning loss function; n represents a sample graph carrying labelsNumber of images (i.e. total number of first sample images and second target sample images carrying predictive labels), y ⁿ A label carried by the sample image representing the nth carried label, f ⁿ Representing the label output by the nth sample image carrying the label under the first target detection model; m represents the number of sample images that do not carry a label (i.e., the number of second initial sample images that do not carry a predictive label), y ^m Representing a predictive label corresponding to an mth sample image without a label, f ^m Representing the label output by the mth sample image without the label under the first target detection model; alpha (t) represents an adjustment term, which is obtained by the following equation 2:

wherein t represents the iteration number of determining the predictive label of each second initial sample image; t (T) ₁ And T ₂ Are all constant, T ₁ Represents the number of iterations when a (T) starts to increase, T ₁ Indicating that alpha (t) reaches alpha _f Is the number of iterations; alpha _f Represents the final value of α (t).

In this embodiment, the server screens out the second target initial sample image with the reliable corresponding prediction tag from the plurality of second initial sample images by presetting the confidence threshold, so as to avoid the influence of the unreliable prediction tag on the target detection in the second scene, thereby ensuring the accuracy of the second target detection model in the second scene, and further improving the generalization capability of the target detection model in different scenes.

As shown in fig. 2, in an exemplary embodiment, the step S108, according to the second target sample image carrying the prediction tag and the first target detection model, obtains a target detection teacher model corresponding to the second scene, specifically includes the following steps:

Step S202, training the first target detection model based on a second target sample image carrying a prediction label to obtain a trained first target detection model.

Step S204, taking the second initial sample image with the confidence coefficient of the corresponding predictive label smaller than the preset confidence coefficient threshold value in each second initial sample image as the second initial sample image, taking the trained first target detection model as the first target detection model, and returning to the step of determining the predictive label of each second initial sample image based on the first target detection model until the trained first target detection model meets the preset training condition.

Step S206, determining the trained first target detection model meeting the preset training conditions as a target detection teacher model corresponding to the second scene.

The preset training conditions may be that the training number of the first target detection model after training reaches the preset training number, or that the number of the second target sample images carrying the prediction tag reaches the preset sample number.

Specifically, the server adds a second target sample image carrying a prediction tag to a training set of the first target detection model, and trains the first target detection model based on the updated training set; then, predicting the prediction label of the second initial sample image with the confidence coefficient of the corresponding prediction label smaller than the preset confidence coefficient threshold again by using the trained first target detection model, namely discarding the prediction label of the second initial sample image with the confidence coefficient of the corresponding prediction label smaller than the preset confidence coefficient threshold, and returning the prediction label as the second initial sample image, wherein the step of determining the prediction label of each second initial sample image based on the first target detection model is performed, and the step of training the first target detection model based on the second target sample image carrying the prediction label is performed until the training times of the trained first target detection model reach the preset training times or the number of the second target sample images carrying the prediction label reach the preset sample number, so as to obtain the trained first target detection model meeting the preset training conditions; and finally, the server determines the trained first target detection model meeting the preset training conditions as a target detection teacher model corresponding to the second scene.

In this embodiment, the server can update the training set of the first target detection model by using the second target sample image carrying the predictive tag, so that the first target detection model can perform target detection on the second scene, and based on the trained first target detection model, the predictive tag can also be generated for the second initial sample image not carrying the tag, that is, the sample image in the second scene is labeled, thereby continuously improving the target detection capability of the first target detection model in the second scene, and improving the generalization capability of the target detection model in different scenes.

In an exemplary embodiment, the knowledge distillation includes response-based knowledge distillation.

As shown in fig. 3, the step S110 is performed to perform knowledge distillation on the target detection student model corresponding to the second scene based on the target detection teacher model, to obtain a second target detection model in the second scene, and specifically includes the following steps:

step S302, performing target detection on the second target sample image based on the target detection teacher model to obtain a teacher label of the second target sample image, and performing target detection on the second target sample image based on the target detection student model to obtain a student label of the second target sample image.

Step S304, training the target detection student model based on first label difference information between the teacher label and the student label of the second target sample image and second label difference information between the student label and the prediction label of the second target sample image to obtain a second target detection model in a second scene.

Specifically, the server performs response-based knowledge distillation on the object detection student model as follows: the server inputs each second target sample image into a target detection teacher model to obtain labels output by each second target sample image under the target detection teacher model as teacher labels of the second target sample images, and inputs each second target sample image into a target detection student model to obtain labels output by each second target sample image under the target detection student model as student labels of the second target sample images; then, the server determines first label difference information between a teacher label and a student label of each second target sample image and second label difference information between the student label of the second target sample image and a prediction label carried by the student label of the second target sample image; and then, the server calculates a response knowledge distillation loss value between the target detection teacher model and the target detection student model according to the response-based knowledge distillation loss function and the first label difference information and the second label difference information corresponding to each second target sample image, trains the target detection student model until the corresponding response knowledge distillation loss value is smaller than a preset response knowledge distillation loss threshold value under the condition that the response knowledge distillation loss value is larger than or equal to the preset response knowledge distillation loss threshold value, and obtains the trained target detection student model, and takes the trained target detection student model as a second target detection model in a second scene.

Wherein the response-based knowledge distillation loss function is shown in equation 3:

L _Response ＝αL _soft +βL _hard (equation 3)

Wherein L is _Response Representing a response-based knowledge distillation loss function; l (L) _soft First label difference information between a teacher label and a student label representing a second target sample image; l (L) _hard Second label difference information between the student label and the prediction label representing a second target sample image; alpha and beta respectively represent weights corresponding to the first tag difference information and the second tag difference information.

First tag difference information L _soft Calculated by equation 4:

wherein J represents a sample in a training set of the target detection teacher model and the target detection student modelThe number, i.e. the total number of second target sample images carrying predictive labels;a teacher label which indicates that the target detection teacher model outputs a jth sample at the temperature T; />A student label indicating that the target detection student model outputs to the jth sample at the temperature T; wherein (1)>And->Calculated by equation 5:

wherein v is _j And u _j And respectively representing the corresponding logarithmic probabilities of the target detection teacher model and the target detection student model.

Second tag difference information L _soft Calculated by equation 6:

wherein c _j A predictive label representing the jth sample;student tag indicating output of target detection student model to jth sample at temperature=1,/>Calculated by equation 7:

in this embodiment, the server performs response-based knowledge distillation on the target detection student model, and can be based on the output layer of the model, so that the output result of the target detection student model is similar to the output result of the target detection teacher model as much as possible, and further, the target detection student model is similar to the target detection teacher model as much as possible, thereby implementing model compression on the target detection teacher model, and improving the target detection speed in the second scenario.

In an exemplary embodiment, the knowledge distillation further includes feature-based knowledge distillation.

As shown in fig. 4, step S110 is performed to distill knowledge of the target detection student model corresponding to the second scene based on the target detection teacher model to obtain a second target detection model in the second scene, and specifically further includes the following steps:

in step S402, it is determined that the second target sample image corresponds to the teacher output feature of the intermediate network layer of the target detection teacher model, and it is determined that the second target sample image corresponds to the student output feature of the intermediate network layer of the target detection student model.

Step S404, training an intermediate network layer of the target detection student model based on the characteristic difference information between the teacher output characteristic and the student output characteristic to obtain a second target detection model in a second scene.

The teacher output characteristic is an output characteristic diagram obtained by a second target sample image based on an intermediate network layer of the target detection teacher model; the student output characteristic is an output characteristic diagram obtained by the second target sample image based on the intermediate network layer of the target detection student model.

Specifically, the process of the server for feature-based knowledge distillation of the object detection student model is as follows: the server firstly inputs each second target sample image into the target detection teacher model, determines an output characteristic diagram of the second target sample image output based on the middle network layer of the target detection teacher model as a teacher output characteristic, and firstly inputs each second target sample image into the target detection student model, and determines an output characteristic diagram of the second target sample image output based on the middle network layer of the target detection teacher model as a student output characteristic; because the structures of the target detection teacher model and the target detection student model are different, the server also needs to map the teacher output feature and the student output feature to the same dimension, and takes the distance between the mapped teacher output feature and the mapped student output feature as feature difference information between the teacher output feature and the student output feature; then, the server calculates a characteristic knowledge distillation loss value between the target detection teacher model and the target detection student model based on the characteristic knowledge distillation loss function and the characteristic difference information, trains the target detection student model until the corresponding characteristic knowledge distillation loss value is smaller than a preset characteristic knowledge distillation loss threshold value under the condition that the characteristic knowledge distillation loss value is larger than or equal to the preset characteristic knowledge distillation loss threshold value, obtains a trained target detection student model, and takes the trained target detection student model as a second target detection model in a second scene.

Wherein the feature-based knowledge distillation loss function is shown in equation 8:

wherein L is _Feature Representing a feature-based knowledge distillation loss function; t' _feature-j Representing the output characteristics of the teacher after the j-th sample is mapped; s' _feature-j Representing student output characteristics after the j-th sample mapping;representing teacher output characteristics and student output characteristicsCharacteristic difference information between them.

In this embodiment, the server performs feature-based knowledge distillation on the object detection student model, and can be based on the middle layer of the model, so that the network structure of the object detection student model is similar to that of the object detection teacher model as much as possible, and further, the object detection student model is similar to the object detection teacher model as much as possible, thereby realizing model compression on the object detection teacher model, and improving the object detection speed in the second scene.

In an exemplary embodiment, the knowledge distillation includes relational-based knowledge distillation.

As shown in fig. 5, step S110 is performed to distill knowledge of the target detection student model corresponding to the second scene based on the target detection teacher model to obtain a second target detection model in the second scene, and specifically further includes the following steps:

In step S502, a teacher label of each second target sample image is obtained based on the target detection teacher model, and a student label of each second target sample image is obtained based on the target detection student model.

Step S504, extracting features of the relation among the teacher labels to obtain teacher label relation information, and extracting features of the relation among the student labels to obtain student label relation information.

Step S506, training the target detection student model based on the relation difference information between the teacher label relation information and the student label relation information to obtain a second target detection model in a second scene.

Specifically, the process of the server for performing relationship-based knowledge distillation on the object detection student model is as follows: the server inputs each second target sample image into a target detection teacher model to obtain labels output by each second target sample image under the target detection teacher model as teacher labels of the second target sample images, and inputs each second target sample image into a target detection student model to obtain labels output by each second target sample image under the target detection student model as student labels of the second target sample images; then, the server performs feature extraction on the relation among the teacher labels corresponding to the plurality of second target sample images through a relation extraction function to obtain teacher label relation information, and performs feature extraction on the relation among the student labels corresponding to the plurality of second target sample images to obtain student label relation information, so as to determine relation difference information representing similarity between the teacher label relation information and the student label relation; and then, the server calculates a relation knowledge distillation loss value between the target detection teacher model and the target detection student model based on the relation knowledge distillation loss function and relation difference information, trains the target detection student model under the condition that the relation knowledge distillation loss value is larger than or equal to a preset relation knowledge distillation loss threshold value until the corresponding relation knowledge distillation loss value is smaller than the preset relation knowledge distillation loss threshold value, obtains the trained target detection student model, and takes the trained target detection student model as a second target detection model in a second scene.

Wherein the relationship-based knowledge distillation loss function is shown in formula 9:

wherein L is _Relation Representing a relationship-based knowledge distillation loss function; t is t _k And t _l Teacher labels respectively representing a kth sample and a l sample; s is(s) _k Sum s _l Student labels respectively representing a kth sample and a l sample; l (L) ₁ (. Cndot.) is the minimization of the absolute error loss function; phi (·) represents a distance normalization factor, which is calculated by equation 10:

in this embodiment, the server performs knowledge distillation on the target detection student model based on the relationship between the output results of the models, so that the network structure of the target detection student model is similar to the network structure of the target detection teacher model as much as possible, and further the target detection student model is similar to the target detection teacher model as much as possible, thereby implementing model compression on the target detection teacher model, and improving the target detection speed in the second scenario.

In an exemplary embodiment, the server may further perform response-based knowledge distillation, feature-based knowledge distillation, and relationship-based knowledge distillation on the object detection student model at the same time, and obtain a second object detection model in the second scenario under the comprehensive loss function as shown in formula 11;

L _D ＝L _GT +λ ₁ L _Response +λ ₂ L _Feature +λ ₃ L _Relation (equation 11)

Wherein L is _D Representing a comprehensive loss function; l (L) _GT Representing a loss function of the target detection backbone network, e.g. L, assuming that the first target detection model is a target detection model based on the YOLO series algorithm _GT Is a loss function of the YOLO series algorithm; lambda (lambda) ₁ 、λ ₂ Lambda of ₃ Respectively is L _Response 、L _Feature L and _Relation corresponding weights are used to balance contributions between response-based knowledge, feature-based knowledge, and relationship-based knowledge.

In an exemplary embodiment, as shown in fig. 6, another object detection model training method is provided, and the method is applied to a server for illustration, and includes the following steps:

step S601, acquiring a first object detection model obtained according to a first sample image in a first scene.

Step S602, a second initial sample image in a second scene is acquired.

Step S603, determining a prediction label of each second initial sample image based on the first target detection model.

In step S604, a second initial sample image with a confidence level greater than or equal to a preset confidence level threshold value corresponding to the prediction label is identified from the second initial sample images, and the second initial sample image is used as a second target sample image carrying the prediction label.

Step S605, training the first target detection model based on the second target sample image carrying the prediction tag, to obtain a trained first target detection model.

Step S606, taking the second initial sample image with the confidence coefficient of the corresponding predictive label smaller than the preset confidence coefficient threshold value in each second initial sample image as the second initial sample image, taking the trained first target detection model as the first target detection model, and returning to the step of determining the predictive label of each second initial sample image based on the first target detection model until the trained first target detection model meets the preset training condition.

Step S607, determining the trained first target detection model satisfying the preset training condition as the target detection teacher model corresponding to the second scene.

Step S608, based on the target detection teacher model, performing response-based knowledge distillation on the target detection student model.

Step S609, based on the target detection teacher model, feature-based knowledge distillation is performed on the target detection student model.

In step S610, based on the target detection teacher model, a relationship-based knowledge distillation is performed on the target detection student model.

In step S611, the object detection student model after knowledge distillation is used as a second object detection model in the second scenario.

In this embodiment, on the one hand, the server screens out a second target initial sample image with a reliable corresponding prediction tag from a plurality of second initial sample images through a preset confidence threshold, so as to avoid the influence of an unreliable prediction tag on target detection in a second scene, thereby ensuring the accuracy of a second target detection model in the second scene, and updating a training set of a first target detection model through the second target sample image carrying the prediction tag, so that the first target detection model can perform target detection on the second scene, and based on the trained first target detection model, can also generate a prediction tag for the second initial sample image without the tag, namely, label marking is performed on the sample image in the second scene, thereby continuously improving the target detection capability of the first target detection model in the second scene; on the other hand, the server can make the output result of the object detection student model be similar to the output result of the object detection teacher model as much as possible through response-based knowledge distillation, feature-based knowledge distillation and relationship-based knowledge distillation of the object detection student model based on the relationship among the output layer, the middle layer and the output result, so that the object detection student model is similar to the object detection teacher model as much as possible, and the model compression of the object detection teacher model is realized, so that the object detection speed in the second scene is improved. According to the target detection model training method based on the process, under the condition that the target detection effect is ensured, a second target detection model with lighter weight can be obtained, so that migration of the target detection model in different scenes is realized, and generalization capability of the target detection model in different scenes is improved.

In order to more clearly illustrate the target detection model training method provided by the embodiment of the present application, a specific embodiment is described below specifically, but it should be understood that the embodiment of the present application is not limited thereto. As shown in fig. 7, in an exemplary embodiment, the present application further provides a method for constructing a semi-supervised knowledge distillation framework optimized by a general purpose robot vision system, which specifically includes the following steps:

step 1: and labeling the sample image of the new scene based on semi-supervised learning.

The method comprises the steps that firstly, a first target detection model in a first scene is determined by a server, and the first target detection model is obtained based on training of a first sample image obtained through manual label marking in the first scene; then, the server labels the second initial sample image which is not labeled by the artificial label in the second scene by using the first target detection model to obtain a second target sample image carrying the label, and trains the first target detection model by using the second target sample image carrying the label to obtain a target detection teacher model capable of realizing target detection in the second scene.

Step 2: and carrying out knowledge distillation on the target detection student model in the second scene based on the target detection teacher model.

Because the model volume of the target detection teacher model is large, the model structure is complex, so that the server needs to compress the target detection teacher model, and the specific model compression process is as follows: based on the target detection teacher model, the server performs response-based knowledge distillation, feature-based knowledge distillation and relationship-based knowledge distillation on the target detection student model respectively, and migrates the response-based knowledge, the feature-based knowledge and the relationship-based knowledge in the target detection teacher model into the target detection student model, so that the target detection student model can realize target detection in a second scene in a lighter structure than the target detection teacher model, and migration of the target detection model in different scenes is realized.

In this embodiment, on the one hand, based on the above method, a manual labeling process of a large amount of data can be avoided, so that the large-scale data set can be more easily expanded; on the other hand, based on the method, the rapid migration of the target detection model in different scenes can be realized, and the generalization capability of the target detection model is improved. In addition, the experiment can also obtain that the detection time of the target detection model is reduced from 185 milliseconds to 45 milliseconds based on the semi-supervised knowledge distillation framework optimized by the universal robot vision system, and recall rate exceeding 99.5 percent and accuracy rate of 92.6 percent are realized in different working environments, so that the performance of the target detection model is obviously improved.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a target detection model training device for realizing the target detection model training method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the object detection model training device or devices provided below may be referred to the limitation of the object detection model training method hereinabove, and will not be described herein.

In an exemplary embodiment, as shown in fig. 8, there is provided an object detection model training apparatus, including: a first model training module 802, a sample image acquisition module 804, a sample label prediction module 806, a detection model migration module 808, and a second model training module 810, wherein:

a first model training module 802, configured to obtain a first target detection model obtained according to a first sample image in a first scene; the first sample image is a sample image carrying a label.

A sample image obtaining module 804, configured to obtain a second initial sample image in a second scene; the second scene is a different scene than the first scene; the second initial sample image is a sample image that does not carry a label.

The sample label prediction module 806 is configured to perform label prediction processing on the second initial sample image based on the first target detection model, so as to obtain a second target sample image carrying a predicted label.

And the detection model migration module 808 is configured to obtain a target detection teacher model corresponding to the second scene according to the second target sample image carrying the prediction tag and the first target detection model.

And the second model training module 810 is configured to perform knowledge distillation on the target detection student model corresponding to the second scene based on the target detection teacher model, to obtain a second target detection model in the second scene.

In an exemplary embodiment, the sample label prediction module 806 is further configured to determine a prediction label for each second initial sample image based on the first object detection model; and identifying the second initial sample image with the confidence coefficient of the corresponding predictive label being greater than or equal to a preset confidence coefficient threshold value from the second initial sample images as a second target sample image carrying the predictive label.

In an exemplary embodiment, the detection model migration module 808 is further configured to train the first target detection model based on the second target sample image carrying the prediction tag, to obtain a trained first target detection model; taking a second initial sample image with the confidence coefficient of the corresponding predictive label smaller than a preset confidence coefficient threshold value in each second initial sample image as a second initial sample image, taking a trained first target detection model as a first target detection model, and returning to the step of determining the predictive label of each second initial sample image based on the first target detection model until the trained first target detection model meets preset training conditions; and determining the trained first target detection model meeting the preset training conditions as a target detection teacher model corresponding to the second scene.

In an exemplary embodiment, the second model training module 810 is further configured to perform target detection on the second target sample image based on the target detection teacher model to obtain a teacher label of the second target sample image, and perform target detection on the second target sample image based on the target detection student model to obtain a student label of the second target sample image; training the target detection student model based on first label difference information between the teacher label and the student label of the second target sample image and second label difference information between the student label and the prediction label of the second target sample image to obtain a second target detection model in a second scene.

In an exemplary embodiment, the second model training module 810 is further configured to determine that the second target sample image corresponds to a teacher output feature of the intermediate network layer of the target detection teacher model, and determine that the second target sample image corresponds to a student output feature of the intermediate network layer of the target detection student model; and training the intermediate network layer of the target detection student model based on the characteristic difference information between the teacher output characteristics and the student output characteristics to obtain a second target detection model in a second scene.

In an exemplary embodiment, the second model training module 810 is further configured to obtain a teacher label of each second target sample image based on the target detection teacher model, and obtain a student label of each second target sample image based on the target detection student model; extracting features of the relation among the teacher labels to obtain teacher label relation information, and extracting features of the relation among the student labels to obtain student label relation information; training the target detection student model based on relation difference information between teacher label relation information and student label relation information to obtain a second target detection model in a second scene.

The respective modules in the above-described object detection model training apparatus may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In an exemplary embodiment, a computer device is provided, which may be a server, and an internal structure thereof may be as shown in fig. 9. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store sample image data for each scene. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of training a target detection model.

It will be appreciated by persons skilled in the art that the architecture shown in fig. 9 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an exemplary embodiment, a computer device is also provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In an exemplary embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method embodiments described above.

In an exemplary embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method for training a target detection model, the method comprising:

2. The method according to claim 1, wherein performing label prediction processing on the second initial sample image based on the first target detection model to obtain a second target sample image carrying a prediction label includes:

3. The method according to claim 2, wherein the obtaining the target detection teacher model corresponding to the second scene according to the second target sample image carrying the prediction tag and the first target detection model includes:

4. The method of claim 1, wherein performing knowledge distillation on the object detection student model corresponding to the second scene based on the object detection teacher model to obtain a second object detection model in the second scene includes:

5. The method of claim 1, wherein the performing knowledge distillation on the object detection student model corresponding to the second scene based on the object detection teacher model to obtain a second object detection model in the second scene further comprises:

6. The method according to any one of claims 1 to 5, wherein the performing knowledge distillation on the object detection student model corresponding to the second scene based on the object detection teacher model to obtain a second object detection model in the second scene further includes:

7. An object detection model training apparatus, the apparatus comprising:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.