WO2022083157A1

WO2022083157A1 - Target detection method and apparatus, and electronic device

Info

Publication number: WO2022083157A1
Application number: PCT/CN2021/101773
Authority: WO
Inventors: 王剑锋
Original assignee: 北京迈格威科技有限公司
Priority date: 2020-10-22
Filing date: 2021-06-23
Publication date: 2022-04-28
Also published as: CN112418268A

Abstract

A target detection method and apparatus, and an electronic device. The method comprises: acquiring an image to undergo detection (S202); and inputting the image into a target detection model, and obtaining a target detection result (S204), the target detection result comprising the position and a score of a bounding box corresponding to a target. A process of training the above target detection model comprises: inputting an image sample in an image sample set into a student network model, and obtaining a student model detection result corresponding to each pixel in a first feature map of the image sample; acquiring, by means of a teacher network model, a teacher model detection result corresponding to the image sample; determining tag allocation information of the image sample according to the teacher model detection result, and calculating a loss function value of the student network model according to the tag allocation information and the student model detection result; and adjusting parameters of the student network model on the basis of the loss function value, and continuing to perform training until a target detection model is obtained. The method improves performance and training efficiency of a target detection model.

Description

Target detection method, device and electronic device

CROSS-REFERENCE TO RELATED APPLICATIONS

This disclosure claims priority to the Chinese Patent Application No. 2020111434527 and entitled "Object Detection Method, Apparatus and Electronic Equipment" filed with the Chinese Patent Office on October 22, 2020, the entire contents of which are incorporated by reference in this disclosure .

technical field

The present disclosure relates to the technical field of model training, and in particular, to a target detection method, device and electronic device.

Background technique

Object detection is a basic task of computer vision. It finds objects of interest to users in a picture and outputs their categories and positions, which can be represented by bounding boxes. At present, the common target detection methods are all implemented based on neural networks. Each position on the feature map output by the neural network corresponds to an output result. Therefore, these methods include a process called label assignment in the training process. The process determines the learning target for each location on the neural network's feature map during training. In other words, during the training process, there are n targets (objects) on a training sample (a picture), and the label assignment process specifies whether each position on the feature map of the neural network is to learn a positive sample (foreground) or a negative sample ( background), if it is learning a positive sample, select 1 from the n targets as a positive sample for this position. This label assignment process is usually based on manually designed rules. Since the manually designed rules are subject to a certain degree, the performance of the network model trained in this label assignment method is not good, which affects the reliability of target detection. .

public content

In view of this, the purpose of the present disclosure is to provide a target detection method, device and electronic device, which can improve at least one of the above problems.

An embodiment of the present disclosure provides a target detection method, which includes: acquiring an image to be detected; inputting the image to be detected into a target detection model to obtain a target detection result; the target detection result includes the position and score of a bounding box corresponding to the target; wherein , the target detection model is trained by the following methods: input the image samples in the image sample set into the student network model, and obtain the student model detection results corresponding to each pixel of the first feature map of the image samples; wherein, the image samples are marked with the target real value frame, the student model detection result includes the score of the first reference position corresponding to each pixel of the first feature map and the coordinate information corresponding to the first reference position; obtain the teacher model detection result of the image sample by the teacher network model; wherein , the teacher network model is a pre-trained model, and the detection result of the teacher model includes the score of the second reference position corresponding to each pixel of the second feature map of the image sample and the coordinate information corresponding to the second reference position; The reference position numbers and/or position points of the first feature map and the second feature map are the same; the label assignment information of the image sample is determined according to the detection result of the teacher model; the loss function value of the student network model is calculated according to the label assignment information and the detection result of the student model ; Adjust the parameters of the student network model based on the loss function value and continue training until the trained student network model is obtained; take the trained student network model as the target detection model.

Optionally, the step of determining the label assignment information according to the detection result of the teacher model includes: for each second reference position, respectively calculating the overlap ratio of the second reference position and each target ground truth frame of the image sample to obtain a matrix. IoU:

Among them, i takes the value [1, N], j takes the value [1, A], N is the number of marked ground truth boxes, and A is the number of second reference positions included in the second feature map; based on the second reference The overlap ratio between the position and each target ground-truth frame and the score of the second reference position determine the prediction quality of the second reference position for each target ground-truth frame corresponding to the target; wherein, the prediction quality is used to characterize the second reference position What is detected is the probability of the target corresponding to the target ground truth frame; the label assignment information of each first reference position is determined based on the prediction quality of each second reference position for the target corresponding to each target ground truth frame.

The above-mentioned based on the overlap ratio of the second prediction frame corresponding to the second reference position and each target ground truth frame and the score of the second reference position, determine the prediction quality of each second reference position for the target corresponding to each target ground truth frame steps, including:

Use the formula q _ij =(s _j ) ^1-α *(IoU _ij ) ^α to calculate the prediction quality of each second reference position for the target corresponding to each target ground-truth frame, and obtain the prediction quality matrix Q; where q _ij takes the value is [0,1], α is a preset hyperparameter with a value in the [0,1] interval, s _j is the score of the jth second reference position, and IoU _ij is the jth second reference position corresponding to the The overlap ratio of the second prediction frame and the i-th target ground-truth frame is the element of the i-th row and the j-th column in the matrix IoU;

Optionally, the above-mentioned image samples are also marked with the target type corresponding to each target ground truth frame; based on the overlap ratio of the second prediction frame corresponding to the second reference position and each target ground truth frame and the score of the second reference position, The step of determining the prediction quality of each second reference position for the target corresponding to each target ground-truth frame includes: using the formula q _ij =(s _ij ) ^1-α *(IoU _ij ) ^α to calculate the Each target ground-truth box corresponds to the prediction quality of the target, and the prediction quality matrix Q is obtained; among them, q _ij is [0,1], α is a preset hyperparameter with a value in the [0,1] interval, s _ij is the score corresponding to the current target type in the score of the jth second reference position, the current target type refers to the target type corresponding to the ith target ground truth box, and IoU _ij is the second reference position corresponding to the jth second reference position. The overlap ratio between the prediction frame and the i-th target ground-truth frame is the element of the i-th row and the j-th column in the matrix IoU;

Optionally, the above-mentioned step of determining the label assignment information of each first reference position based on the prediction quality of the target corresponding to each target truth frame based on each second reference position includes: for each second reference position, The second reference position selects the maximum prediction quality from the prediction quality of the target corresponding to each target ground-truth frame; judges whether the maximum prediction quality is greater than or equal to the first preset quality value; if so, assigns the maximum prediction quality to the second reference position The positive label of the corresponding target.

Optionally, the above-mentioned step of determining the label assignment information of the first reference position based on the prediction quality of the target corresponding to each target ground truth frame based on each second reference position includes: for the jth column in the prediction quality matrix, all from Select the element q _mj with the largest value in the column elements;

If q _mj is greater than t _p , set the element X _mj in the label assignment matrix corresponding to q _mj equal to the first value; for the elements q _ij in the jth column except q _mj , if q _ij is less than t _n , set q _ij corresponding to The element X _ij in the label assignment matrix of is equal to the second value; if q _ij is less than or equal to t _p and greater than or equal to t _n , set the corresponding X _{ij of q ij} _equal to the third value;

If q _mj is less than t _n , set the element X _ij in the j-th column of the label assignment matrix equal to the second value;

If q _mj is less than or equal to t _p and greater than or equal to t _n , set the element X _mj in the label assignment matrix corresponding to q _mj equal to the third value; for the elements q _ij in the jth column other than q _mj , if q _ij less than t _n , set the element X _ij in the label assignment matrix corresponding to q _ij equal to the second value; if q _ij is less than or equal to t _p and greater than or equal to t _n , set the element X _{ij in the label assignment matrix corresponding to q ij} _equal to third value;

Wherein, t _p >t _n , t _p and t _n are preset thresholds respectively, the first value represents positive samples, the second value represents negative samples, and the third value represents ignore samples.

The above-mentioned step of determining the label assignment information of each first reference position based on the prediction quality of the corresponding target of each target ground truth frame based on each second reference position includes: for the i-th row in the prediction quality matrix, all of the steps are obtained from this row. Select the target element q _im greater than t _p from the elements, and set the element X _im corresponding to the target element in the i-th row of the initial label assignment matrix as the first value; among them, q _im is greater than the unselected element in the row other elements q _iu ;

For elements q _iu other than q _im in the i-th row, if q _iu is less than or equal to t _p and greater than or equal to t _n , set the element X _{iu of the initial matrix corresponding to the label assignment of q iu} _equal to the third value; if q _iu is less than t _n , set the element X _{iu of the initial matrix corresponding to the label assignment of qi iu} _equal to the second value;

Check whether the elements in the jth column of the initial label assignment matrix have conflicting elements; among them, there are more than 2 conflicting elements, and the elements are all the first value; if there are conflicting elements, obtain the prediction corresponding to the conflicting elements in the prediction quality matrix Quality, retain the element with the highest predicted quality in the conflicting elements as the first value, and modify the remaining elements to the third value to obtain a label assignment matrix; where t _p >t _n , t _p and t _n are preset thresholds, respectively, the first value represents positive samples, the second value represents negative samples, and the third value represents ignore samples.

Optionally, the above step of calculating the loss function value of the student network model according to the label assignment information and the student model detection result includes: performing the following steps for each first reference position corresponding to each pixel in the first feature map. : determine the second reference position corresponding to the first reference position; determine the target truth value frame of the first reference position based on the label assignment information of the first reference position corresponding to the first reference position; based on the target truth value frame of the first reference position Calculate the classification loss function value and the regression loss function value with the score of the first reference position; determine the loss function value of the student network model based on the classification loss function value and the regression loss function value of each first reference position.

The embodiment of the present disclosure also provides a target detection device, wherein the device includes: an image acquisition module, configured to acquire an image to be detected; a target detection module, configured to input the to-be-detected image into a target detection model to obtain a target detection result; The detection result includes the position and score of the bounding box corresponding to the target; wherein, the target detection model is trained in the following way: input the image samples in the image sample set into the student network model, and obtain each pixel corresponding to the first feature map of the image sample. The student model detection results of The teacher model detection result of the sample; wherein, the teacher network model is a pre-trained model, and the teacher model detection result includes the score and position coordinates of the second reference position corresponding to each pixel of the second feature map of the image sample; wherein , the reference positions of the first feature map and the second feature map are the same; the label assignment information of the image sample is determined according to the detection result of the teacher model; the loss function value of the student network model is calculated according to the label assignment information and the detection result of the student model; based on the loss function Adjust the parameters of the student network model and continue training until the trained student network model is obtained; the trained student network model is used as the target detection model.

Optionally, the target detection module is further configured to: for each second reference position, calculate the intersection of the second prediction frame corresponding to the second reference position and each target ground truth frame of the image sample respectively. Overlap ratio, get matrix IoU:

Wherein, i takes the value [1, N], j takes the value [1, A], N is the number of the labeled ground truth boxes, and A is the number of the second reference positions included in the second feature map; Based on the overlap ratio of the second reference position and each of the target ground-truth boxes and the score of the second reference position, determining that the second reference position corresponds to a target for each of the target ground-truth boxes The prediction quality of The prediction quality of the target corresponding to the frame determines the label assignment information for each of the first reference positions. Use the formula q _ij =(s _j ) ^1-α *(IoU _ij ) ^α to calculate the prediction quality of each of the second reference positions for the target corresponding to each of the target ground-truth boxes, to obtain a prediction quality matrix Q; wherein, q _ij is [0,1], α is a preset hyperparameter with a value in the [0,1] interval, s _j is the score of the jth second reference position, and IoU _ij is the jth The overlap ratio of the second prediction frame corresponding to the two reference positions and the i-th target truth frame is the element of the i-th row and the j-th column in the matrix IoU;

Optionally, the target detection module is further configured to: use the formula q _ij =(s _j ) ^1-α *(IoU _ij ) ^α to calculate that each of the second reference positions corresponds to each of the target ground-truth boxes The prediction quality of the target is obtained, and the prediction quality matrix Q is obtained; wherein, q _ij is [0,1], α is a preset hyperparameter with a value in the [0,1] interval, and s _j is the jth The score of the two reference positions, IoU _ij is the overlap ratio of the second prediction frame corresponding to the jth second reference position and the i-th target true value frame, and is the element of the i-th row and the j-th column in the matrix IoU;

Optionally, the image sample is also marked with the target type corresponding to each target ground-truth frame; the target detection module is further configured to: use the formula q _ij =(s _ij ) ^1-α *(IoU _ij ) ^α to calculate For the prediction quality of the target corresponding to each of the second reference positions, the prediction quality matrix Q is obtained; wherein, q _ij takes a value of [0, 1], and α is a value between [0, 1] ] The preset hyperparameter of the interval, s _ij is the score corresponding to the current target type in the score of the jth second reference position, and the current target type refers to the target type corresponding to the ith target truth value frame, IoU _ij is the overlap ratio of the second prediction frame corresponding to the j-th second reference position and the i-th target ground-truth frame, and is the element of the i-th row and the j-th column in the matrix IoU;

Optionally, the target detection module is further configured to: for each of the second reference positions, select the maximum prediction quality from the prediction qualities of the target corresponding to each of the target ground truth frames at the second reference position; Determine whether the maximum prediction quality is greater than or equal to a first preset quality value; if so, assign a positive sample label of the target corresponding to the maximum prediction quality to the second reference position.

Optionally, the target detection module is further configured to: for the jth column in the prediction quality matrix, select the element q _mj with the largest value from the elements of the column; if q _mj is greater than t _p , set q _mj The element X _mj in the corresponding label assignment matrix is equal to the first value; for the elements q _ij other than q _mj in the jth column, if q _ij is less than t _n , set the element X _{ij in the label assignment matrix corresponding to q ij} _equal to The second value; if q _ij is less than or equal to t _p and greater than or equal to t _n , set X _ij corresponding to q _ij equal to the third value; if q _mj is less than t _n , set X ij in the jth column of the label assignment matrix The element X _ij is equal to the second value; if q _mj is less than or equal to t _p and greater than or equal to t _n , set the element X _mj in the label assignment matrix corresponding to q _mj to be equal to the third value; for the jth column in For elements q _ij other than q _mj , if q _ij is less than t _n , set the element X _ij in the label assignment matrix corresponding to q _ij equal to the second value; if q _ij is less than or equal to t _p and greater than or equal to t _n , set the element X _ij in the label assignment matrix corresponding to q _ij equal to the third value; wherein, t _p >t _n , t _p and t _n are preset thresholds respectively, the first value represents a positive sample, and the The second value represents negative samples and the third value represents ignore samples.

Optionally, the target detection module is further configured to: for the ith row in the prediction quality matrix, select a target element q _im greater than t _p from the elements of the row, and set the ith row in the label assignment initial matrix. The element X _im corresponding to the target element in the row is the first value; wherein, q _im is greater than other elements q _iu that are not selected in the row element; for the element q iu other than q _im in the i-th row _, If the qi _iu is less than or equal to t _p and greater than or equal to t _n , set the element X _iu of the initial matrix corresponding to the label assignment of the qi _iu equal to the third value; if the qi _iu is less than t _n , set the q iu _iu corresponds to the element X _iu of the initial label allocation matrix equal to the second value; check whether the elements in the jth column of the initial label allocation matrix have conflicting elements; wherein, the conflicting elements are more than 2, and the elements are all the first value; if there is a conflicting element, obtain the prediction quality corresponding to the conflicting element in the prediction quality matrix, keep the element with the highest prediction quality in the conflicting element as the first value, and modify the remaining elements to be the The third value is to obtain a label assignment matrix; wherein, t _p >t _n , t _p and t _n are preset thresholds respectively, the first value represents positive samples, the second value represents negative samples, and the third value represents negative samples. A value to ignore samples.

Optionally, the target detection module is further configured to: for each first reference position corresponding to each pixel in the first feature map, perform the following steps: assign information based on the label of the first reference position. , determine the target ground truth frame of the first reference position; calculate the classification loss function value and the regression loss function value based on the target ground value frame of the first reference position and the score of the first reference position; The classification loss function value and the regression loss function value of the first reference position determine the loss function value of the student network model.

Embodiments of the present disclosure also provide an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the above method when executing the computer program.

Embodiments of the present disclosure further provide a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and the computer program executes the steps of the foregoing method when the computer program is run by a processor.

Other features and advantages of the present disclosure will be set forth in the description that follows, and, in part, will be apparent from the description, or will be learned by practice of the present disclosure. The objectives and other advantages of the disclosure will be realized and attained by the structure particularly pointed out in the description and drawings.

In order to make the above-mentioned objects, features and advantages of the present disclosure more obvious and easy to understand, the preferred embodiments are exemplified below, and are described in detail as follows in conjunction with the accompanying drawings.

Description of drawings

In order to illustrate the specific embodiments of the present disclosure or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the specific embodiments or the prior art. Obviously, the accompanying drawings in the following description The drawings are some embodiments of the present disclosure. For those skilled in the art, other drawings can also be obtained from these drawings without creative efforts.

FIG. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;

2 is a flowchart of a target detection method according to an embodiment of the present disclosure;

3 is a flowchart of a training method for a target detection model provided by an embodiment of the present disclosure;

4 is a schematic diagram of a reference position in an anchor frame-based technology provided by an embodiment of the present disclosure;

5 is a schematic diagram of a reference position based on a non-anchor frame technology provided by an embodiment of the present disclosure;

6 is a flowchart of another method for training a target detection model according to an embodiment of the present disclosure;

7 is a flowchart of another method for training a target detection model provided by an embodiment of the present disclosure;

8 is a flowchart of another method for training a target detection model according to an embodiment of the present disclosure;

FIG. 9 is a flow chart of training a target detection model according to an embodiment of the present disclosure;

FIG. 10 is a schematic structural diagram of a target detection apparatus according to an embodiment of the present disclosure.

Detailed ways

In order to make the purposes, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the present disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of them. example. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative efforts shall fall within the protection scope of the present disclosure.

Knowledge distillation refers to the method of using one (possibly deeper or more complex) neural network to guide another (possibly shallower or simpler) neural network during training. The former is called the teacher network model and the latter is called the student network model.

In the process of realizing the present disclosure, the inventor found through research that if a certain position on the feature map of the teacher network model has a better detection result for a certain target, then the corresponding position of the student network model has a better detection result for the target. The probability will be better, so it is more reasonable to assign the label of the target to this position and then train the student network model. The student network model obtained by this training method has higher reliability for target detection. Based on this, the embodiments of the present disclosure provide a target detection method, device, and electronic device. In this technology, a trained teacher network model is introduced to predict the training sample of the student network model, and then the label assignment information of the sample is determined. , based on this information to complete the training of the student network model to improve the performance of the student network model, thereby improving the reliability of target detection using the student network model. The following description will be given by way of examples.

Embodiments of the present disclosure first provide an exemplary illustration of an electronic device that can implement a target detection method and apparatus. As shown in FIG. 1 , a schematic structural diagram of an electronic device, the electronic device 100 includes one or more processors 102, One or more memories 104, input devices 106, output devices 108, and one or more image capture devices 110 are interconnected by a bus system 112 and/or other form of connection mechanism (not shown). It should be noted that the components and structures of the electronic device 100 shown in FIG. 1 are only exemplary and not restrictive, and the electronic device may also have other components and structures as required.

The processor 102 can be a server, an intelligent terminal, or a device that includes a central processing unit (CPU) or other forms of processing units with data processing capabilities and/or instruction execution capabilities, and can process data from other components in the electronic device 100. Processing may also control other components in electronic device 100 to perform object detection functions.

Memory 104 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory may include, for example, random access memory (RAM) and/or cache memory, among others. Non-volatile memory may include, for example, read only memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer-readable storage medium, and the processor 102 may execute the program instructions to implement the functions described below (implemented by the processing device) in the disclosed embodiments and/or other desired functions. Various application programs and various data, such as various data used and/or generated by the application program, etc., may also be stored in the computer-readable storage medium.

Input device 106 may be a device used by a user to input instructions, and may include one or more of a keyboard, mouse, microphone, touch screen, and the like.

The output device 108 may output various information (eg, images or sounds) to the outside (eg, a user), and may include one or more of a display, a speaker, and the like.

The image acquisition device 110 may acquire a training sample set and store the acquired training sample set in the memory 104 for use by other components.

Exemplarily, each device in the electronic device configured to implement the target detection method and device according to the embodiments of the present disclosure may be integrated or distributed, such as the processor 102 , the memory 104 , the input device 106 and the output device 108 . The image acquisition device 110 is set in a designated position where the sample can be acquired. When the various devices in the above electronic device are integrated and set, the electronic device can be implemented as a smart terminal such as a camera, a smart phone, a tablet computer, a computer, a vehicle-mounted terminal, and the like.

In practical applications, an electronic device configured to implement the target detection method and apparatus according to the embodiments of the present disclosure may include more or less components than the above-described exemplary electronic device, which is not limited herein.

This embodiment also provides a target detection method. Referring to the flowchart of a target detection method shown in FIG. 2 , the method mainly includes the following steps S202 to S204:

Step S202, acquiring an image to be detected;

In this embodiment, the image to be detected may be an image acquired by an image acquisition device such as a camera or a camera, and the image acquisition device may be installed in the waiting hall of a passenger station (such as a subway or high-speed rail) according to the detection needs to perform face detection. Image or human body image acquisition; image acquisition equipment can also be set at traffic intersections or on both sides of the road to collect vehicle images according to detection needs. The above image to be detected can also be obtained from a third-party device (such as a cloud server, etc.).

In addition to the image to be detected, the image to be detected may also be an image corresponding to other types of target objects such as animals, designated objects, etc., which is not limited in this embodiment of the present disclosure.

Step S204: Input the image to be detected into the target detection model to obtain a target detection result; the target detection result includes the position and score of the bounding box corresponding to the target.

The target detection model can be a detection model for a specific type of target, or a detection model for multiple different types of targets. After inputting the image to be detected into the target detection model, the target detection model performs target detection on the image to be detected. The detection image contains the target of the type that the detection model can detect. Through the target detection model, the bounding box and the corresponding score corresponding to each target belonging to the type can be obtained. The score corresponding to the bounding box indicates that the target corresponding to the bounding box belongs to this type. Confidence of the target.

Referring to the flowchart of the training method of the target detection model shown in FIG. 3 , the above target detection model is mainly obtained by training the following steps S302 to S312:

Step S302, input the image samples in the image sample set into the student network model, and obtain the student model detection result corresponding to each pixel of the first feature map of the image sample; The model detection result includes the score of the first reference position corresponding to each pixel of the first feature map and the coordinate information corresponding to the first reference position; wherein, the first reference position includes the first anchor frame or the first position point;

The reference position (including the above-mentioned first reference position and the second reference position mentioned later) mentioned in the embodiments of the present disclosure may be based on anchor-based technology (anchor-based), each anchor point (that is, the pixel on the feature map) point) corresponding anchor box. Among them, each anchor point corresponds to one or more anchor boxes, and each anchor box corresponds to a prediction box. In this manner, the coordinate information corresponding to the first reference position is the coordinate offset between the first anchor frame and the first prediction frame corresponding to the first anchor frame, that is, the position of the first prediction frame relative to the first anchor frame, The coordinates of the first prediction frame can be determined from the coordinates of the first anchor frame and the coordinate offset.

The reference position (including the above-mentioned first reference position and the second reference position mentioned later) mentioned in the embodiment of the present disclosure may be based on the anchor-free technology (anchor-free), and each pixel on the feature map corresponds to one or multiple locations. Among them, each position point corresponds to a prediction frame, and the position point can be regarded as a reference position. In this way, the coordinate information corresponding to the first reference position is the first prediction frame of the position point relative to the position point. The coordinate offset of , the coordinates of the first prediction frame can be determined from the coordinates of the position point and the coordinate offset.

In order to further understand the above reference position, refer to the schematic diagram of the reference position based on the anchor frame technology shown in FIG. 4 and the schematic diagram of the reference position based on the non-anchor frame technology shown in FIG. The small line box represents the anchor box, the small dot in the middle of the anchor box represents the anchor point, and the dotted box represents the prediction box corresponding to the anchor box. The coordinate information corresponding to the anchor frame is the relative position of the dotted frame and the anchor frame. The arrow in FIG. 4 indicates the relative positional relationship between the anchor frame and the prediction frame. The dot in the middle of the dashed box in Figure 5 represents a position point, which can be regarded as a reference position in Figure 4, and the dotted box represents the prediction frame corresponding to the position point, then the coordinate information corresponding to the first reference position is The position of the dotted box in FIG. 5 is relative to the position point in FIG. 5 , and the arrow in FIG. 5 indicates the relative positional relationship between the position point and the prediction frame.

The teacher network model and the student network model in the embodiments of the present disclosure may both be based on the anchor frame technology, or one may be based on the anchor frame technology and the other may be based on the non-anchor frame technology, or both may be based on the non-anchor frame technology technology, as long as the number of reference positions (that is, the number of anchor boxes/the number of position points) is the same for both.

The coordinates of the anchor box, the prediction box and the bounding box can be represented by the coordinates of the upper left corner and the lower right corner of the box. For example, taking the anchor frame shown in Figure 4 as the first reference position as an example, for the convenience of description, the first reference position is called the first anchor frame. If the point in the upper left corner of the feature map is used as the coordinate origin, the horizontal direction is the positive direction of the abscissa axis (X axis), and the vertical downward direction is the positive direction of the ordinate axis (Y axis); the first anchor frame is expressed as: [a1(x _a1 ,y _a1 ),b1(x _b1 ,y _b1 )], where a1 represents the coordinates of the upper left corner of the first anchor frame, b1 represents the coordinates of the lower right corner of the first anchor frame, and the first prediction frame corresponding to the first anchor frame uses A1 and B1 Coordinate representation of two points, suppose Δx _a1 represents the abscissa offset between the upper left corner position point a1 and A1, Δy _a1 represents the ordinate offset between the upper left corner position point a1 and A1, Δx _b1 represents the upper right corner position point The abscissa offset between b1 and B1, Δy _b1 represents the ordinate offset between the upper right corner position point b1 and B1, then the coordinate information corresponding to the first anchor frame is expressed as [a1(Δx _a1 ,Δy _a1 ), b1(Δx _b1 ,Δy _b1 )], the coordinates of the first prediction frame corresponding to the first anchor frame can be determined from the coordinates of the first anchor frame itself and the coordinate information corresponding to the corresponding anchor frame: [A1(x _a1 - Δx _a1 , y _a1 -Δy _a1 ), B1(x _b1 +Δx _b1 ,y _b1 +Δy _b1 )].

The coordinate transformation method based on the non-anchor frame technology shown in FIG. 5 is similar to the above-mentioned FIG. 4, except that the first reference position is a position point, and the first prediction frame determined based on the position point uses two upper left corner and lower right corner. If the coordinates of the point are represented, the coordinate information corresponding to the first reference position is the coordinate offset between the upper left corner of the first prediction frame and the position point, and the coordinate offset between the lower right corner of the first prediction frame and the coordinate point.

The image sample set can be an image set obtained in advance from the network or other storage devices, or can be a sample set formed by manually labeling images collected by a collection device of an electronic device. The image sample set includes multiple images Samples, the specific number can be set according to demand.

Before the image samples in the image sample set are input into the student network model, the above-mentioned image sample set has already marked the ground-truth box of the target. The purpose of marking the ground-truth box of the target is to frame the target contained in the image sample, such as an image The samples include targets such as pedestrians, motor vehicles, non-motor vehicles, or faces. In this embodiment, faces, pedestrians, motor vehicles, and non-motor vehicles are marked with target ground-truth boxes in the form of bounding boxes. In actual use, in order to distinguish different types of targets in an image sample, target ground-truth boxes of different colors can be used to label different types of targets, or different category labels can be used to label them, such as 1 for the face frame and 3 for the face frame. For the motor vehicle frame, 5 represents a non-motor vehicle frame and the like, which are not limited here.

In addition to the detection of different types of targets, the above target detection model can also only detect targets of the same type. For example, the target detection model only detects one of the types of targets such as pedestrians, motor vehicles, non-motor vehicles or faces. , the target ground-truth box only frames the target of this type in the image sample.

Through the above-mentioned target ground-truth box, it can not only indicate which targets are included in the image sample and what type the target belongs to, but also obtain the position coordinates of the target in the image sample.

After the image sample is processed by the target detection of the student network model, the above-mentioned first feature map is output. The number of the first feature map is determined by the model design, and there can be multiple ones. Each first feature map can include C*H*W pixels, Among them, C is the number of feature map channels, H is the length of the feature map, and W is the width of the feature map. The reference position corresponding to each pixel in the first feature map (each pixel may correspond to one or more anchor boxes, or each pixel corresponds to one or more position points) is represented by the first reference position. The student network model can obtain the score corresponding to the first reference position corresponding to each pixel point of the first feature map and the coordinate information corresponding to the first reference position by performing target detection on the image sample.

If the student network model is trained to detect multiple types of targets, the score of a certain first reference position is multiple scores corresponding to multiple types, and the score corresponding to a certain type among the multiple scores represents the first benchmark The location-detected target belongs to the classification probability value of this type.

For example, the student network model is trained to detect 4 types of targets. The student network model is a network model based on anchor box technology. See the first feature map corresponding to each pixel in the student detection results shown in Table 1. Example of scores for an anchor box:

Table 1

Among them, the first anchor frame 1 to the first anchor frame 4 are the anchor frames corresponding to X pixels (X is less than or equal to 4) in the first feature map respectively. For each anchor frame, the corresponding face, human body, motor vehicle and The score of the license plate (that is, the classification probability score, or simply the score or the probability score) is shown in Table 1. According to the scores shown in Table 1, it can be seen that the target detected by the first anchor box 1 is more likely to belong to the human body type, while The target detected by the first anchor frame 2 is more likely to belong to the motor vehicle type, the target detected by the first anchor frame 3 is more likely to belong to the face type 1, and the target detected by the first anchor frame 4 belongs to the license plate. type is more likely.

Step S304, obtaining the teacher model detection result of the image sample by the teacher network model; wherein, the teacher network model is a pre-trained model, and the teacher model detection result includes the first point corresponding to each pixel of the second feature map of the above-mentioned image sample. The score of the second reference position and the coordinate information corresponding to the second reference position; wherein, the number of reference positions of the first feature map and the second feature map is the same, for example, both are C*H*W, wherein each feature map has H *W positions, the number of channels of the second feature map is C.

The above-mentioned second reference position includes a second anchor frame or a second position point. The score of the above-mentioned second reference position is also the classification probability value that the target detected in the second reference position output by the teacher network model for prediction belongs to each type. The larger the score, the more likely the target detected in the second reference position is. belong to this type.

The coordinate information corresponding to the second reference position is similar to the coordinate information corresponding to the first reference position, and the coordinate information corresponding to the second reference position is the coordinate offset between the second reference position and the second prediction frame corresponding to the second reference position , the coordinates of the second prediction frame corresponding to the second reference position can be determined based on the coordinates of the second reference position and the coordinate offset.

The teacher network model is a neural network model pre-trained by using the above-mentioned image sample set or other training image sample sets, wherein the teacher network model is used to predict the above-mentioned image samples and output a second feature map, and each pixel in the second feature map is used. The corresponding reference positions are all represented by second reference positions, and the second reference positions are in one-to-one correspondence with the first reference positions.

The teacher network model and the student network model may be network models for target detection based on anchor boxes. In this way, the number of anchor boxes of the first feature map obtained by the student network model and the second feature map obtained by the teacher network model are the same. The teacher network model and the student network model can also be network models that do not perform target detection based on anchor frames. In this way, the first feature map obtained by the student network model and the second feature map obtained by the teacher network model have the same number of location points. . Alternatively, one of the teacher network model and the student network model can be a network model for target detection based on anchor boxes, and the other is a network model for target detection not based on anchor boxes, then the number of anchor boxes of one and the position of the other Points are the same.

Step S306: Determine the label assignment information of the image sample according to the teacher model detection result.

Because the teacher network model is a pre-trained model, the score of the second reference position can reflect the probability that the target contained in the second reference position belongs to each target type. At the same time, because the image samples are marked with target ground-truth boxes, Based on the overlap ratio of the target ground-truth frame and the prediction frame corresponding to the second reference position, the possibility of which target the second reference position is specifically can be determined, and based on this information, the probability of the target corresponding to the second reference position can be determined. The target, that is, the label corresponding to the second reference position. The first reference position corresponds to the second reference position one-to-one, and the label corresponding to the second reference position is the label corresponding to the first reference position.

For example, taking the teacher network model as an anchor box-based neural network model as an example, there are two target ground truth boxes on the image sample, namely target ground truth box 1 corresponding to target 1 and target ground truth box 2 corresponding to target 2. There are 100 second anchor boxes in the second feature map, then according to the score and coordinate information of the second anchor box corresponding to each type of target, as well as the target truth box 1 and target truth box 2, it can be determined that target 1 and target 2 should be allocated Which of the 100 second anchor boxes, that is, which second anchor boxes have detected target 1 (positive samples of target 1) or target 2 (positive samples of target 2).

Because the number of reference positions corresponding to the first feature map and the second feature map is the same, and the second reference position corresponds to the first reference position one-to-one, the label assignment information of the first reference position can be based on the detection of the target at the second reference position. The situation is confirmed. If a target is detected at the second reference position corresponding to the first reference position, the label of the first reference position relative to the target is a positive sample. If no target is detected at the second reference position corresponding to the first reference position , then the label of the first reference position relative to the target is a negative sample. If you don’t care (because you don’t want to use the loss brought by this sample to carry out gradient backpropagation) or you are not sure whether the second reference position corresponding to the first reference position is detected A certain target, the label of the first reference position relative to the target is an ignored sample.

The above label assignment information is specifically used to represent the sample type of each target corresponding to the first reference position, and the sample type includes positive samples and negative samples. Alternatively, sample types include positive samples, negative samples, and ignore samples. The above positive samples can be represented by 1, negative samples can be represented by 0, and ignored samples can be represented by -1. Positive samples indicate that the first reference position should detect the target, negative samples indicate that the first reference position should not detect the target, and ignore samples indicate that they do not care or are uncertain whether the first reference position should detect the target. The resulting gradient is not backpropagated.

Assuming that the number of ground truth boxes on the image sample is N (that is, there are N targets), the total number of second reference positions on the second feature map is A (that is, there are A anchor boxes). For example, the image sample is a matrix of (3, H, W). After multiple convolutions of the teacher network model, M feature maps are generated, and each feature map can be represented as a matrix of (C, Hv, Wv), Among them, v is the feature map identifier, C is the number of channels corresponding to each feature map, when each position of the feature map corresponds to an anchor box, there are Hv×Wv anchor boxes on each feature map, A is all the second The total number of anchor boxes on the feature map, i.e.

After the above analysis, the result of the label assignment information of the image sample determined based on the detection result of the teacher model can be expressed as an N×A matrix, and each column has at most one 1, that is, each second anchor frame is assigned to at most one target. , or no target is assigned (becoming a negative sample or ignoring sample), and each row can have any number of 1s, that is, a target can be assigned to one or more second anchor boxes, or not assigned to any second anchor box.

The following can be represented by the label assignment information matrix X:

Among them, X _ij takes the value of 0, 1 or -1, where 1 is the value corresponding to the positive sample label, 0 is the value corresponding to the negative sample label, -1 is the value corresponding to the ignored sample label; the value of i is 1 A positive integer between 1 and A, and the value of j is a positive integer between 1 and A.

During specific implementation, the label assignment information corresponding to each position in the above matrix is obtained according to the detection result of the teacher network model.

Step S308: Calculate the loss function value of the student network model according to the label assignment information and the student model detection result.

The value of the loss function in the training process of the student network model in this embodiment not only depends on the detection result output by the student network model itself, but also on the label assignment information determined based on the detection result of the teacher network model. The label assignment can be performed more accurately, which can effectively alleviate the influence of the subjectivity of manual design of label assignment rules on the training effect of the student network model, make the calculation of the loss function value more accurate, and provide reliable data for the parameter adjustment of the student network model. .

Specifically, the loss function value of the student network model is a value calculated based on the loss function of the student network model, the score of the first reference position corresponding to each pixel in the first feature map, and the above label assignment information.

Step S310, adjust the parameters of the student network model based on the loss function value and continue training until the trained student network model is obtained;

Step S312, using the trained student network model as the target detection model.

In the above-mentioned target detection method provided by the embodiment of the present application, the training process of the target detection model configured to detect images is as follows: input the image samples in the image sample set into the student network model, and obtain each of the first feature maps corresponding to the image samples. The detection result of the student model corresponding to the pixel point; the detection result of the teacher model of the image sample by the teacher network model is obtained; wherein, the teacher network model is a pre-trained model, and the detection result of the teacher model includes the second feature corresponding to the image sample The score of the second reference position corresponding to each pixel in the figure and the coordinate information corresponding to the second reference position; determine the label assignment information of the image sample according to the teacher model detection result, and calculate the student network according to the label assignment information and the student model detection result The loss function value of the model, adjust the parameters of the student network model based on the loss function value and continue training until the trained student network model is obtained, and the trained student network model is used as the target detection model. The label assignment method of this training process is more efficient. Objectification and rationalization make the target detection model obtained by training more reliable, thereby improving the accuracy of target detection. Compared with the manually designed label allocation method, the label allocation method in this embodiment is more efficient and effectively alleviates the The influence of the subjectivity of artificially designed label assignment rules on the training effect of the student network model. This label assignment method can be adapted to both anchor box-based and non-anchor box-based networks, and is more universal than the label assignment method designed for a certain network. .

This embodiment also provides another method for training a target detection model, which is implemented on the basis of the above method, and focuses on the specific implementation of determining the label assignment information of the image samples. The student network model and the teacher network model are both An example of a network model based on anchor frame technology, as shown in Figure 6 is a flowchart of another target detection model training method, which illustrates the implementation of training the student network model, wherein the trained student network model is For the target detection model, specifically, it can be implemented with reference to the following steps S602 to S616:

Step S602, input the image samples in the image sample set into the student network model, and obtain the student model detection result corresponding to each pixel of the first feature map of the image sample; The model detection result includes the score of the first anchor frame corresponding to each pixel of the first feature map and the coordinate information corresponding to the first anchor frame;

Step S604, obtaining the teacher model detection result of the image sample by the teacher network model; wherein, the teacher network model is a pre-trained model, and the teacher model detection result includes the first pixel corresponding to each pixel point of the second feature map of the above-mentioned image sample. The score of the second anchor frame and the coordinate information corresponding to the second anchor frame;

The score corresponding to the second anchor frame can be obtained by using the teacher network model, wherein the score is the probability value of each second anchor frame corresponding to each target output by the teacher network model for target detection.

Step S606, for each second anchor frame, calculate the overlap ratio of the second prediction frame corresponding to the second anchor frame and each target truth frame of the image sample to obtain the matrix IoU;

Among them, i takes the value [1, N], j takes the value [1, A], N is the number of marked ground-truth boxes, and A is the number of second anchor boxes included in the second feature map.

The above overlap ratio is IoU (Intersection over Union), which represents the degree of overlap between the two frame areas. Usually, the value of the overlap ratio is [0, 1]. When the second prediction frame and the manually marked target ground truth frame When there is no overlap at all, its overlap ratio is 0. When the second prediction frame completely overlaps with the manually labeled target ground-truth frame, its overlap ratio is 1. In other cases, the overlap ratio is between 0 and 1. Arbitrary floating point number.

Step S608, based on the overlap ratio of the second prediction frame corresponding to the second anchor frame and each target ground-truth frame and the score of the second anchor frame, determine the prediction quality of the second anchor frame for the target corresponding to each target ground-truth frame ; wherein, the prediction quality is used to characterize the probability that the second anchor frame detects the target corresponding to the target ground truth frame.

For any second anchor frame in the second anchor frame obtained by the teacher network model, calculating the overlap ratio between the second prediction frame corresponding to the second anchor frame and each target ground-truth frame of the image sample, N× The overlap ratio matrix of 1, where the formula for calculating the overlap ratio is:

Among them, IoU _i is the overlap ratio between the second prediction frame corresponding to the second anchor frame and the i-th target ground-truth frame, i is a positive integer, and its value is [1, N], and N is the corresponding target ground-truth frame target number. The pred box represents the second anchor box, and the gt box _i represents the ith target ground-truth box.

For scenarios where the student network is trained to detect only one type of target, such as only faces, the image sample can contain one or more faces, and each face corresponds to a target ground-truth box. This scenario Next, the score s _j of the second anchor box is a numerical value. The prediction quality of each second anchor frame for the target corresponding to each target ground-truth frame can be calculated by using the formula q _ij =(s _j ) ^1-α *(IoU _ij ) ^α , and the prediction quality matrix Q is obtained; wherein, q _ij takes The value is [0,1], α is a preset hyperparameter with a value in the [0,1] interval, s _j is the score of the jth second anchor box (valued at [0,1]), IoU _ij is the overlap ratio of the second prediction frame corresponding to the j-th second anchor frame and the i-th target ground-truth frame (the value is [0,1]), which is the element of the i-th row and the j-th column in the matrix IoU;

For scenes where the student network is trained to detect multiple types of targets, such as detecting faces, human bodies, motor vehicles and non-motor vehicles, the image samples can contain any one of faces, human bodies, motor vehicles and non-motor vehicles, or One or more targets of multiple types, each target corresponds to a target ground-truth frame, in this scenario, the image sample is also marked with the target type corresponding to each target ground-truth frame, and the score s _ij of the second anchor frame is corresponding to A number of values corresponding to the target types that the student network can detect. The prediction quality of each second anchor frame for each target corresponding to the target ground-truth frame can be calculated by using the formula q _ij =(s _ij ) ^1-α *(IoU _ij ) ^α , and a prediction quality matrix Q is obtained; wherein, q _ij is [0,1], α is a preset hyperparameter with a value in the [0,1] interval, s _ij is the score corresponding to the current target type in the score of the jth second anchor box (value is [0,1]), the current target type refers to the target type corresponding to the i-th target ground-truth frame, and IoU _ij is the second prediction frame corresponding to the j-th second anchor frame and the i-th target ground-truth frame The overlap ratio of (value [0,1]) is the element of the i-th row and the j-th column in the matrix IoU;

Since the above prediction quality considers both the overlap ratio and the score s, that is, the confidence score of the second anchor frame corresponding to the target, which is objective and reasonable and does not depend on the anchor frame, etc., has good versatility, and is conducive to the determination of label assignment information .

In addition to using the overlap ratio method to obtain the prediction quality, the second anchor box score and coordinate information of the teacher network model can also be used to calculate the loss function with each target ground-truth box (and the method of calculating the loss function of the student network model). Consistent), obtain the prediction quality q=e^(-loss _m ) of the second anchor box corresponding to each pixel on the second feature map, where loss _m represents the loss function value of the mth position on the second feature map. The above only provides two methods for calculating the prediction quality. In this embodiment, the method for calculating the prediction quality is not limited.

Step S610: Determine label assignment information of the second anchor frame based on the prediction quality of the second anchor frame for the target corresponding to each target ground-truth frame.

Since the calculation of the second anchor frame and the target ground-truth frame can obtain an N×A overlap ratio matrix, the above formula for calculating the prediction quality can also be used to obtain an N×A prediction quality matrix. For each second anchor frame, The specific implementation process of determining the label assignment information based on the predicted quality can be implemented by steps A1-A3:

Step A1, select the maximum prediction quality from the second anchor frame for the prediction quality of the target corresponding to each target truth frame;

Step A2, judging whether the maximum predicted quality is greater than the first preset quality value;

Step A3, if yes, assign the positive sample label of the target corresponding to the maximum prediction quality to the second anchor frame, and for the labels of the second anchor frame corresponding to other targets, all may be negative labels, or some may be negative labels. for ignore tags.

The above-mentioned first preset quality value is set according to the actual situation, and is not limited here.

In actual use, negative labels and ignore labels can be assigned based on the following ways. When the maximum prediction quality is less than the first preset quality value, determine whether the maximum prediction quality is greater than the second preset quality value; wherein, the first prediction quality value is greater than the second prediction quality value; if so, assign ignore for the second anchor frame sample label; if no, assign a negative sample label to the second anchor box. The ignored sample labels and negative sample labels here are both for the target corresponding to the maximum prediction quality. If the target corresponding to the maximum prediction quality is a negative label, the second anchor frame is probably the background and other scene areas, and other targets are also is a negative label. If the target corresponding to the maximum prediction quality is an ignore label, the second anchor frame may be a negative label or an ignore label corresponding to other targets. The labels corresponding to other targets can also be based on the second anchor frame and other targets. The calculated prediction quality, the above-mentioned first prediction quality value and the second prediction quality value are determined.

The assigned sample labels may be represented by numerical values, letters or characters, which are not limited here.

Take the sample image containing 4 targets, among which target 1-target 4 in the 4 targets are face, motor vehicle, pedestrian and non-motor vehicle type as an example, it is assumed that the teacher network model is used to predict three image samples. The second anchor box, the three second anchor boxes and the prediction quality calculated by each target ground-truth box are expressed in the form of the following matrix:

Among them, the first row of the matrix represents the target 1, the target marked with the target ground-truth box, which corresponds to each second anchor box (for example: the second anchor box 1, the second anchor box 2, the second anchor box 3). Prediction quality, the second row of the matrix represents target 2, the target marked with the target ground-truth box, respectively corresponds to the prediction quality of the three second anchor boxes, and the third row of the matrix represents target 3, the marked target ground-truth frame. The targets correspond to the prediction quality of the three second anchor boxes respectively, and the fourth row of the matrix indicates that the target 4, the target marked with the target ground-truth box, corresponds to the prediction quality of the three second anchor boxes, respectively.

In this embodiment, the first preset quality value is set to 0.7, and the second preset quality value is 0.4. Since the maximum prediction quality of the first column of 0.8 is greater than the first preset quality value, it means that the second anchor frame 1 corresponds to The target of is target 1, and the positive sample label of target 1 and the negative sample label of the remaining targets can be assigned to the second anchor box 1; since the maximum prediction quality 0.3 of the second column is less than the second preset quality value, it can be used for the second anchor box. Box 2 assigns negative sample labels for all targets; since the maximum predicted quality 0.5 of the third column is between the first and second preset quality values, the second anchor box 3 can be assigned the ignore sample label for target 4 and negative sample labels for other targets.

Step S612, calculate the loss function value of the student network model according to the label assignment information and the student model detection result;

Step S614, adjust the parameters of the student network model based on the loss function value and continue training until the trained student network model is obtained;

Step S616, using the trained student network model as the target detection model.

The training method for the above target detection model provided by the embodiment of the present disclosure can calculate the overlap ratio according to the second prediction frame corresponding to the second anchor frame and each target ground-truth frame, and according to the second prediction frame corresponding to the second anchor frame and the The overlap ratio of each target ground-truth frame and the score of the second anchor frame determine the prediction quality of the second anchor frame for the target corresponding to each target ground-truth frame; and accurately obtain the label assignment information of the second anchor frame according to the predicted quality , according to the label assignment information of the second anchor frame, the label assignment is performed on the first feature map, which makes the label assignment objective and rational, which can effectively alleviate the influence of the subjectivity of manually designing label assignment rules on the training effect of the student network model, thereby improving the Performance of Student Network Models.

This embodiment also provides another method for training a target detection model, which is implemented on the basis of the above method, and focuses on the specific implementation of determining the label assignment information of an image sample in multiple target type application scenarios, so as to The student network model and the teacher network model are both network models based on the anchor frame technology. For example, as shown in FIG. 7, the flowchart of another target detection model training method mainly includes the following steps S702 to S716:

Step S702, input the image samples in the image sample set into the student network model, and obtain the student model detection result corresponding to each pixel of the first feature map of the image sample; The model detection result includes the score of the first anchor frame corresponding to each pixel of the first feature map and the coordinate information corresponding to the first anchor frame;

Step S704, obtaining the teacher model detection result of the image sample by the teacher network model; wherein, the teacher network model is a pre-trained model, and the teacher model detection result includes the first pixel corresponding to each pixel point of the second feature map of the above-mentioned image sample. The score of the second anchor frame and the coordinate information corresponding to the second anchor frame;

Step S706, calculating the overlap ratio between each target ground-truth frame of the image sample and the second prediction frame corresponding to each second anchor frame to obtain the matrix IoU:

Among them, i takes the value [1, N], j takes the value [1, A], N is the number of marked ground truth boxes, and A is the number of second anchor boxes included in the second feature map; in this embodiment, , there are N labeled ground truth boxes and A second anchor boxes of the image sample. By calculating the overlap ratio of each labeled ground truth box and the second prediction box corresponding to each second anchor box, the overlap can be obtained. than the matrix IoU.

Step S708, calculate the prediction quality of each second anchor frame for the target corresponding to each target truth frame, and obtain the prediction quality matrix Q;

Specifically, the formula q _ij =(s _ij ) ^1-α *(IoU _ij ) ^α can be used to calculate the prediction quality of each second anchor frame for the target corresponding to each target ground-truth frame to obtain the prediction quality matrix Q, and Q is the same is an N×A matrix. Among them, q _ij is [0, 1], which represents the prediction quality of the target corresponding to the j-th second anchor frame and the i-th target ground-truth frame; α is the preset value in the [0, 1] interval parameter, s _ij is the score corresponding to the current target type in the score of the j-th second anchor frame, the current target type refers to the target type corresponding to the i-th target ground-truth frame, and IoU _ij is the j-th second anchor frame The overlap ratio of the second prediction box corresponding to the anchor box and the i-th target ground-truth box is the element of the i-th row and the j-th column in the matrix IoU.

Step S710, convert the prediction quality matrix into the label assignment matrix X corresponding to the label assignment information:

Among them, X _ij is 0, 1 or -1, where 1 is the value corresponding to the positive sample label, 0 is the value corresponding to the negative sample label, and -1 is the value corresponding to the ignored sample label.

The specific process of converting the predicted quality matrix into the label assignment matrix can be converted from two different angles: row and column. The following is an example of converting the columns of the predicted quality matrix. The specific process of the conversion can be implemented through steps B1 to B4. :

Step B1, for the jth column in the prediction quality matrix, select the element q _mj with the largest value from the elements of this column;

Similarly, the prediction quality matrix obtained above is

The first preset quality value is 0.7 and the second preset quality value is 0.4 as an example for illustration, wherein the largest element in the first column is q ₃₁ , the largest element in the second column is q ₂₂ , and the The largest element is q ₄₃ .

Step B2, if q _mj is greater than t _p , set the element X _mj in the label assignment matrix corresponding to q _mj equal to the first value; for the elements q _ij in the jth column except q _mj , if q _ij is less than t _n , set X _ij corresponding to q _ij is equal to the second value; if q _ij is less than or equal to t _p and greater than or equal to t _n , set X _{ij corresponding to q ij} _equal to the third value;

Among them, the first value represents positive samples, the second value represents negative samples, and the third value represents ignore samples. In actual use, the value 1 can be used as the label of positive samples, the value 0 can be used as the label of negative samples, and the value -1 can be used as the label of negative samples. The label of the sample is ignored, or other characters are used as the label of the above-mentioned sample, which is not limited here.

Continuing from the previous example, take t _p as the first preset quality value of 0.7 and t _n as the second preset quality value of 0.4 as an example, wherein the largest element in the prediction quality of the first column in the above prediction quality matrix is that q ₃₁ =0.8 is greater than The first preset quality value is 0.7, then X _mj at the corresponding position of the label assignment matrix and q _mj is set to the first value, and for the other elements q ₁₁ , q ₂₁ and q ₄₁ in the first column except q ₃₁ The preset quality of all is less than the second preset quality value of 0.4, then X ₁₁ , X ₂₁ and X ₄₁ at the positions corresponding to q ₁₁ , q ₂₁ and q ₄₁ of the label assignment matrix can be set to the second value; and if q _ij is less than or equal to the first preset quality value of 0.7 and greater than or equal to the second preset quality value of 0.4, set X _{ij corresponding to q ij} _equal to the third value.

The process of determining the label assignment information at the corresponding position by comparing the predicted quality with the preset quality value in the second column and the third column is the same as the above, and will not be repeated here.

Step B3, if q _mj is less than t _n , set the element X _ij in the jth column of the label assignment matrix equal to the second value;

If the largest q _mj is smaller than t _n , it means that other prediction qualities in this column are smaller than t _n , for example, the maximum prediction quality of the second column of the above prediction quality matrix is 0.3, which is smaller than the second preset quality value of 0.4, Then the second column of the label assignment matrix obtained after conversion is the second value.

Step B4, if q _mj is less than or equal to t _p and greater than or equal to t _n , set X _mj in the label assignment matrix corresponding to q _mj to be equal to the third value; for the elements q _ij in the jth column other than q _mj , if If q _ij is less than t _n , set X _ij in the label assignment matrix corresponding to q _ij equal to the second value; if q _ij is less than or equal to t _p and greater than or equal to t _n , set X _{ij in the label assignment matrix corresponding to q ij} _equal to The third value; wherein, t _p >t _n , and t _p and t _n are preset thresholds respectively.

Continuing from the previous example, in the third column of the above prediction quality matrix, the maximum value is 0.5, between 0.7 and 0.4, then the value corresponding to this position after conversion is -1, and the values of the remaining elements in the third column are all less than 0.4, so Both are 0.

So for the above prediction quality matrix:

The label assignment matrix converted from the predicted quality matrix is:

There is at most one 1 in each column of the above assigned label assignment matrix, that is, the location of each second anchor box is assigned to at most one target (becomes a positive sample), or no target is assigned (becomes a negative sample or ignores the sample) .

The following takes the conversion of the rows of the prediction quality matrix as an example to illustrate that the specific process of conversion can be implemented through steps C1-step C2:

Step C1, for the i-th row in the prediction quality matrix, select the target element q _im greater than t _p from the elements in the row, and set the element X _im corresponding to the target element in the i-th row of the initial label assignment matrix as the A value; wherein, q _im is greater than the other elements q _iu that are not selected in this row of elements;

Positive sample labels for the same target can be assigned to one or more second anchor boxes, or none of them.

Step C2, for elements q _iu other than q _im in the i-th row, if the q _iu is less than or equal to t _p and greater than or equal to t _n , set the element X _{iu of the initial matrix corresponding to the label assignment of the q iu} _equal to The third value; if the qi _iu is less than t _n , set the element X _iu of the initial matrix corresponding to the label assignment of the qi _iu to be equal to the second value;

The above initial matrix of label assignment can be understood as an empty matrix, and each element X is assigned an assignment. After the above steps C1 and C2, element X will be assigned 0, 1 or -1.

Step C3, check whether the elements in the jth column of the initial label assignment matrix have conflicting elements; wherein, the conflicting elements are more than 2, and the elements are the first value; if there are conflicting elements, obtain the conflicting elements in the prediction quality matrix. For the corresponding prediction quality, keep the element with the highest prediction quality in the conflicting elements as the first value, and modify the remaining elements to be the third value to obtain a label assignment matrix;

By checking the conflicting elements, it can be guaranteed that each column of the final label assignment matrix has only one first value, that is, each position corresponds to only one positive label of the sample.

Wherein, t _p >t _n , and t _p and t _n are preset thresholds respectively.

The above-mentioned t _p and t _n are respectively the first preset quality value and the second preset quality value. In this embodiment, X _im corresponding to the selected target element q _im in each row is set to be between the first values. In addition, it is also necessary to compare other prediction quality with the preset threshold to obtain the corresponding sample type (positive sample, negative sample or ignored sample). The element X _iu corresponding to the element q _iu of the quality value is set as the third value; the X _iu corresponding to the q _iu smaller than the second preset quality value is set as the second value.

Step S712, calculate the loss function value of the student network model according to the label assignment information and the student model detection result;

Step S714, adjust the parameters of the student network model based on the loss function value and continue training until the trained student network model is obtained;

Step S716, using the trained student network model as the target detection model.

The training method for the above target detection model provided by the embodiment of the present disclosure can accurately obtain label assignment information by comparing the prediction quality with a preset threshold, and perform label assignment on the first anchor frame corresponding to the first feature map according to the label assignment information , which makes the label assignment objective and rational, which can effectively alleviate the influence of the subjectivity of manually designing label assignment rules on the training effect of the student network model, thereby improving the performance of the student network model.

This embodiment also provides another method for training a target detection model, which is implemented on the basis of the above method, and focuses on the specific implementation of calculating the loss function value of the student network model. The student network model and the teacher network model are both Taking the network model based on the anchor frame technology as an example, the flowchart of another target detection model training method shown in FIG. 8 mainly includes the following steps S802 to S820:

Step S802, input the image samples in the image sample set into the student network model, and obtain the student model detection result corresponding to each pixel of the first feature map of the image sample; The model detection result includes the score of the first anchor frame corresponding to each pixel of the first feature map and the coordinate information corresponding to the first anchor frame;

Step S804, obtaining the teacher model detection result of the image sample by the teacher network model; wherein, the teacher network model is a pre-trained model, and the teacher model detection result includes the first pixel corresponding to each pixel of the second feature map of the above-mentioned image sample. The score of the second anchor frame and the coordinate information corresponding to the second anchor frame;

Step S806, determining the label assignment information of the image sample according to the teacher model detection result;

Step S808, for each first anchor frame corresponding to each pixel in the first feature map, perform the following operations in steps S812 to S818;

Step S812, based on the label assignment information, determine the target ground truth frame of the first anchor frame;

The target ground-truth box corresponding to the first anchor box can be determined based on the label assignment information. Taking the first anchor frame 1 as an example, if its label assignment information is (0, 0, 1, 0), the target ground-truth frame of the target corresponding to the positive sample label of the second anchor frame 1 can be used as the first anchor frame The target ground-truth box.

Step S814, calculating the classification loss function value and the regression loss function value based on the target ground truth frame of the first anchor frame and the score of the first anchor frame;

The above classification loss function value can be obtained through the classification loss function, and the classification loss function can be a cross entropy function. For example, if there are only two types of target categories, it can be a binary cross entropy function (Binary Cross Entropy). If the target category is multiple categories, the multi-category cross entropy function (softmax_cross_entropy) can be used.

The above regression loss function value can be obtained by using the overlap ratio loss function (IoU Loss), therefore, the regression loss function value loss2=-log(IoU), IoU is the first prediction frame corresponding to the first anchor frame and the target truth value frame. Overlap ratio.

In actual use, the classification loss function and the regression loss function can be selected according to actual needs. Therefore, the corresponding classification loss function value and regression loss can be calculated according to the target ground truth frame of the first anchor frame and the score of the first anchor frame. The function value is not limited or described here.

Step S816, determining the loss function value of the student network model based on the classification loss function value and the regression loss function value of each first anchor frame;

Usually, the loss function value of the student network model is obtained by adding the calculated classification loss function value and the regression loss function value.

Step S818, adjust the parameters of the student network model based on the loss function value and continue training until the trained student network model is obtained;

The above step S818 can be realized by step D1-step D2:

Step D1, adjust the parameters of the student network model based on the loss function value to continue training;

Step D2, when the loss function value converges to a preset value or the number of training times reaches a preset number of times, the training is stopped, and a trained student network model is obtained.

Usually, when the loss function value is greater than the preset value, it means that the currently trained student network model has not reached the preset convergence degree, and the process from the above steps S802 to S816 can be repeated until the obtained loss function value converges to the preset value. Stop the training of the student network model.

Alternatively, the training of the student network model is stopped when the number of times of repeating steps S802 to S816 reaches a preset number of times. In actual use, the preset value and the preset number of times can be set according to the actual situation, which is not limited here.

Step S820, using the trained student network model as the target detection model.

The training method for the above target detection model provided by the embodiment of the present disclosure can input the image samples in the image sample set into the student network model, obtain the first feature map corresponding to the image sample, and obtain the teacher model detection result of the sample by the teacher network model. For each first anchor frame corresponding to each pixel in the first feature map, label assignment information can be determined based on the detection result of the teacher network, and the target truth value frame corresponding to the first anchor frame can be determined based on the label assignment information, and can be determined according to the first anchor frame. The target truth box of the anchor box and the score of the first anchor box determine the loss function value of the student network model; so that the loss function value generated by the student network model during the training process not only depends on the output of the student network model itself, but also based on Based on the detection result of the trained teacher network model, label assignment information can be determined based on the detection result, and then the loss function value of the student network model is calculated based on the label assignment information, so that the calculation of the loss function value is more accurate, which is beneficial for the student network. The parameter tuning of the model provides reliable data.

Further, in order to fully understand the training method of the above target detection model, Figure 9 shows a training flow chart of a target detection model. As shown in Figure 9, the student network model and the teacher network model are based on anchor frame technology. Taking the network model as an example, the leftmost picture 900 is an image sample marked with an artificial target ground-truth frame. When the image sample is input into the teacher network model 901, the score of the second anchor frame corresponding to each pixel in the second feature map 902 can be obtained scores2 and coordinate information to obtain the second prediction frame pred boxes2, based on the calculation of the second prediction frame pred boxes2 and the image sample target ground truth frame, the overlap ratio matrix IoU can be obtained, based on the overlap ratio matrix IoU and the second anchor box score scores2 The predicted quality matrix qualities of the second feature map 902 are obtained, and the label assignment information can be determined based on the predicted quality matrix qualities, and this process corresponds to the assignment in FIG. 9 ; when the image sample is input into the student network model 903, the first feature map 904 can be obtained The score scores1 of the first anchor frame corresponding to each pixel and the first prediction frame pred boxes1 corresponding to the first anchor frame are used to assign the target ground-truth box to the first feature map 904 by using the label assignment information. The value box and the first prediction box pred boxes1 and the score scores1 calculate the classification loss function value (classification loss) and the regression loss function value (regresssion loss), and finally calculate the loss function of the student network model through the classification loss function and the regression loss function value. The value (loss), based on which the loss function value (loss) is used to train the student network model.

In the training process of the target detection model, there is no need to manually label the first feature map, and the detection results obtained by the teacher network model are used to assign labels to the first feature map, so that the label assignment is objective and rational. The first feature map trains the student network model, optimizes the training process of the student network model, and can effectively alleviate the influence of the subjectivity of manually designing label assignment rules on the training effect of the student network model, thereby improving the performance of the student network model. The accuracy of the student network model for target detection, etc.

Corresponding to the foregoing method embodiments, an embodiment of the present disclosure provides a target detection apparatus. FIG. 10 shows a schematic structural diagram of a target detection apparatus. As shown in FIG. 10 , the apparatus includes:

an image acquisition module 1002, configured to acquire an image to be detected;

The target detection module 1004 is configured to input the image to be detected into the target detection model to obtain the target detection result; the target detection result includes the position and score of the bounding box corresponding to the target; wherein, the target detection model is trained by the following methods: The image sample is input to the student network model, and a student model detection result corresponding to each pixel of the first feature map of the image sample is obtained; wherein, the image sample is marked with a target ground-truth frame, and the student model detection result includes the first feature map. The score of the first reference position corresponding to each pixel point and the coordinate information corresponding to the first reference position; obtain the teacher model detection result of the image sample by the teacher network model; wherein, the teacher network model is a pre-trained model, and the teacher model detects The result includes the score of the second reference position corresponding to each pixel of the second feature map of the image sample and the coordinate information corresponding to the second reference position; wherein, the number of reference positions of the first feature map and the second feature map is the same; Determine the label assignment information of the image sample according to the detection result of the teacher model; calculate the loss function value of the student network model according to the label assignment information and the detection result of the student model; adjust the parameters of the student network model based on the loss function value and continue training until the trained students are obtained Network model; use the trained student network model as the target detection model.

In the above-mentioned target detection device provided by the embodiment of the present application, the training process of the target detection model configured to detect images is as follows: input the image samples in the image sample set into the student network model, and obtain each of the first feature maps corresponding to the image samples. The detection result of the student model corresponding to the pixel point; the detection result of the teacher model of the image sample by the teacher network model is obtained; wherein, the teacher network model is a pre-trained model, and the detection result of the teacher model includes the second feature corresponding to the image sample The score of the second reference position corresponding to each pixel in the figure and the coordinate information corresponding to the second reference position; the above-mentioned teacher model detection result is used to determine the label assignment information of the image sample, and the student network model is calculated according to the label assignment information and the student model detection result According to the loss function value, adjust the parameters of the student network model based on the loss function value and continue training until the trained student network model is obtained, and the trained student network model is used as the target detection model. The label assignment method of this training process is more objective. Compared with the manually designed label allocation method, the label allocation method in this embodiment is more efficient, and effectively alleviates the manual labor The subjectivity of designing label assignment rules affects the training effect of the student network model. This label assignment method can be adapted to anchor box-based and non-anchor box-based networks, and is more universal than the label assignment method designed for a certain network.

The above-mentioned target detection module 1004 is also configured to, for each second reference position, respectively calculate the overlap ratio of the second prediction frame corresponding to the second reference position and each target true value frame of the image sample to obtain the matrix IoU:

Among them, i takes the value [1, N], j takes the value [1, A], N is the number of marked ground truth boxes, and A is the number of second reference positions included in the second feature map; based on the second reference The overlap ratio of the second prediction frame corresponding to the position and each target ground-truth frame and the score of the second reference position determine the prediction quality of the second reference position for the target corresponding to each target ground-truth frame; wherein, the prediction quality is used for Characterizing that what the second reference position detects is the probability of the target corresponding to the target ground-truth frame; the label assignment of each first reference position is determined based on the prediction quality of each second reference position for the target corresponding to each target ground-truth frame information.

The above-mentioned target detection module 1004 is further configured to calculate the prediction quality of each second reference position for the target corresponding to each target ground truth frame by using the formula q _ij =(s _j ) ^1-α *(IoU _ij ) ^α , to obtain the prediction quality Matrix Q; among them, q _ij is [0,1], α is a preset hyperparameter with a value in the [0,1] interval, s _j is the score of the jth second reference position, and IoU _ij is the first The overlap ratio of the second prediction frame corresponding to the j second reference positions and the i-th target ground-truth frame is the element of the i-th row and the j-th column in the matrix IoU;

The above image samples are also marked with the target type corresponding to each target ground-truth box;

Use the formula q _ij =(s _ij ) ^1-α *(IoU _ij ) ^α to calculate the prediction quality of each second reference position for the target corresponding to each target ground-truth frame, and obtain the prediction quality matrix Q; where q _ij takes the value is [0,1], α is a preset hyperparameter with a value in the [0,1] interval, s _ij is the score corresponding to the current target type in the score of the j-th second reference position, and the current target type refers to The target type corresponding to the i-th target ground-truth frame, IoU _ij is the overlap ratio between the second prediction frame corresponding to the j-th second reference position and the i-th target ground-truth frame, which is the i-th row j-th in the matrix IoU the elements of the column;

The above-mentioned target detection module 1004 is also configured to, for each second reference position, select the maximum prediction quality from the prediction quality of the target corresponding to each target truth frame of the second reference position; determine whether the maximum prediction quality is greater than or equal to The first preset quality value; if yes, assign the positive sample label of the target corresponding to the maximum predicted quality to the second reference position.

The above-mentioned target detection module 1004 is further configured to, for the jth column in the prediction quality matrix, select the element q _mj with the largest value from the elements of the column;

The above-mentioned target detection module 1004 is also configured to, for the i-th row in the prediction quality matrix, select a target element q _im greater than t _p from the elements of the row, and set the label assignment initial matrix in the i-th row corresponding to the target element. The element X _im of is the first value; wherein, q _im is larger than other elements q _iu that are not selected in this row of elements;

The above-mentioned target detection module 1004 is further configured to perform the following steps for each first reference position corresponding to each pixel in the first feature map: determine a second reference position corresponding to the first reference position; based on the first reference position The label assignment information of the first reference position is determined, and the target truth value frame of the first reference position is determined; the classification loss function value and the regression loss function value are calculated based on the target truth value frame of the first reference position and the score of the first reference position; based on each first reference position The classification loss function value and the regression loss function value determine the loss function value of the student network model.

The target detection device provided by the embodiment of the present disclosure has the same technical features as the above-mentioned target detection method, so it can also solve the same technical problem and achieve the same technical effect.

This embodiment further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program executes the steps of the above-mentioned target detection method when the computer program is run by the processing device.

The target detection method, apparatus, and computer program product of an electronic device provided by the embodiments of the present disclosure include a computer-readable storage medium storing program codes, and the instructions included in the program codes can be used to execute the methods described in the foregoing method embodiments. For the specific implementation, reference may be made to the method embodiments, which will not be repeated here.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, for the specific working process of the electronic equipment and apparatus described above, reference may be made to the corresponding process in the foregoing method embodiments, which will not be repeated here.

In addition, in the description of the embodiments of the present disclosure, unless otherwise expressly specified and limited, the terms "installed", "connected" and "connected" should be understood in a broad sense, for example, it may be a fixed connection or a detachable connection , or integrally connected; it can be a mechanical connection or an electrical connection; it can be a direct connection, or an indirect connection through an intermediate medium, or the internal communication between the two components. For those skilled in the art, the specific meanings of the above terms in the present disclosure can be understood in specific situations.

The functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present disclosure can be embodied in the form of software products in essence, or the parts that make contributions to the prior art or the parts of the technical solutions. The computer software products are stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the embodiments of the present disclosure. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

In the description of the present disclosure, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. The indicated orientation or positional relationship is based on the orientation or positional relationship shown in the drawings, and is only for the convenience of describing the present disclosure and simplifying the description, rather than indicating or implying that the indicated device or element must have a specific orientation or a specific orientation. construction and operation, and therefore should not be construed as limiting the present disclosure. Furthermore, the terms "first", "second", and "third" are used for descriptive purposes only and should not be construed to indicate or imply relative importance.

Finally, it should be noted that the above embodiments are only specific implementations of the present disclosure, and are used to illustrate the technical solutions of the present disclosure, rather than limit them. The protection scope of the present disclosure is not limited thereto, although referring to the foregoing embodiments The present disclosure has been described in detail, and those skilled in the art should understand that any person skilled in the art who is familiar with the technical field can still modify the technical solutions described in the foregoing embodiments or easily think of them within the technical scope disclosed in the present disclosure. change, or equivalently replace some of the technical features; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present disclosure, and should be included within the protection scope of the present disclosure. Inside. Therefore, the protection scope of the present disclosure should be subject to the protection scope of the claims.

Industrial Applicability

In the technical solution proposed in the present disclosure, the label assignment method is more efficient, and the influence of the subjectivity of manually designing label assignment rules on the training effect of the student network model is effectively alleviated. The label assignment method can be adapted to anchor frame-based and non-anchor-based The network of boxes is more general than the label assignment method designed for a certain network.

Claims

A target detection method, characterized in that the method comprises:

Get the image to be detected;

Input the image to be detected into a target detection model to obtain a target detection result; the target detection result includes the position and score of the bounding box corresponding to the target; wherein, the target detection model is trained in the following manner:

Input the image samples in the image sample set into the student network model, and obtain the student model detection result corresponding to each pixel of the first feature map of the image sample; wherein, the image sample is marked with a target ground-truth frame, and the The student model detection result includes the score of the first reference position corresponding to each pixel of the first feature map and the coordinate information corresponding to the first reference position;

Obtain the teacher model detection result of the image sample by the teacher network model; wherein, the teacher network model is a pre-trained model, and the teacher model detection result includes each pixel corresponding to the second feature map of the image sample. The score of the second reference position corresponding to the point and the coordinate information corresponding to the second reference position; wherein, the number of reference positions of the first feature map and the second feature map is the same;

Determine the label assignment information of the image sample according to the detection result of the teacher model;

Calculate the loss function value of the student network model according to the label assignment information and the student model detection result;

Adjust the parameters of the student network model based on the loss function value and continue training until the trained student network model is obtained;

The trained student network model is used as the target detection model.
The method according to claim 1, wherein the step of determining label assignment information according to the detection result of the teacher model comprises:

For each of the second reference positions, calculate the overlap ratio of the second prediction frame corresponding to the second reference position and each target ground-truth frame of the image sample to obtain the matrix IoU:

Wherein, i takes the value [1, N], j takes the value [1, A], N is the number of the labeled ground truth boxes, and A is the number of the second reference positions included in the second feature map;

Based on the overlap ratio of the second reference position and each of the target ground-truth boxes and the score of the second reference position, determining that the second reference position corresponds to a target for each of the target ground-truth boxes The prediction quality; wherein, the prediction quality is used to characterize the probability that what the second reference position detects is the target corresponding to the target ground truth frame;

Label assignment information for each of the first reference positions is determined based on the prediction quality of each of the second reference positions for the target corresponding to each of the target ground truth boxes.
The method according to claim 2, wherein, based on the overlap ratio of the second prediction frame corresponding to the second reference position and each of the target ground-truth boxes and the score of the second reference position , the step of determining the prediction quality of each of the second reference positions for the target corresponding to each of the target ground-truth boxes, including:

Use the formula q ij =(s j ) 1-α *(IoU ij ) α to calculate the prediction quality of each of the second reference positions for the target corresponding to each of the target ground-truth boxes, to obtain a prediction quality matrix Q; wherein, q ij is [0,1], α is a preset hyperparameter with a value in the [0,1] interval, s j is the score of the jth second reference position, and IoU ij is the jth The overlap ratio of the second prediction frame corresponding to the two reference positions and the i-th target truth frame is the element of the i-th row and the j-th column in the matrix IoU;
The method according to claim 2, wherein the image sample is further marked with the target type corresponding to each target ground-truth frame;

Based on the overlap ratio of the second prediction frame corresponding to the second reference position and each of the target ground-truth boxes and the score of the second reference position, it is determined that each of the second reference positions is for each The step of predicting the quality of the target corresponding to the target ground-truth frame includes:

Use the formula q ij =(s ij ) 1-α *(IoU ij ) α to calculate the prediction quality of each of the second reference positions for the target corresponding to each of the target ground-truth boxes, and obtain a prediction quality matrix Q; wherein, q ij is [0, 1], α is a preset hyperparameter with a value in the [0, 1] interval, s ij is the score corresponding to the current target type in the score of the j-th second reference position, so The current target type refers to the target type corresponding to the i-th target ground-truth frame, and IoU ij is the overlap ratio between the second prediction frame corresponding to the j-th second reference position and the i-th target ground-truth frame, and is the matrix IoU The element in the i-th row and the j-th column;
The method according to any one of claims 2-4, wherein each of the first reference positions is determined based on the prediction quality of each of the second reference positions for the target corresponding to each of the target ground-truth boxes The steps for assigning information on labels include:

For each of the second reference positions, the maximum prediction quality is selected from the prediction quality of the target corresponding to each of the target ground-truth boxes at the second reference position;

judging whether the maximum predicted quality is greater than or equal to a first preset quality value;

If yes, assign the positive sample label of the target corresponding to the maximum prediction quality to the second reference position.
The method according to claim 3 or 4, wherein the step of determining the label assignment information of the first reference position based on the prediction quality of each second reference position for each target corresponding to the target ground truth frame, include:

For the jth column in the prediction quality matrix, select the element q mj with the largest value from the elements of the column;

If q mj is greater than t p , set the element X mj in the label assignment matrix corresponding to q mj equal to the first value; for the elements q ij in the jth column except q mj , if q ij is less than t n , set q ij corresponding to The element X ij in the label assignment matrix of is equal to the second value; if q ij is less than or equal to t p and greater than or equal to t n , set the corresponding X ij of q ij equal to the third value;

If q mj is less than t n , set the element X ij in the j-th column of the label assignment matrix equal to the second value;

If q mj is less than or equal to t p and greater than or equal to t n , set the element X mj in the label assignment matrix corresponding to q mj equal to the third value; for the elements q ij in the jth column other than q mj , if q ij is less than t n , set the element X ij in the label assignment matrix corresponding to q ij equal to the second value; if q ij is less than or equal to t p and greater than or equal to t n , set q ij corresponding to the element X ij in the label assignment matrix element X ij is equal to said third value;

t p >t n , t p and t n are preset thresholds respectively, the first value represents positive samples, the second value represents negative samples, and the third value represents ignore samples.
The method according to claim 3 or 4, wherein the label assignment of each of the first reference positions is determined based on the prediction quality of each of the second reference positions for the target corresponding to each of the target ground-truth boxes Information steps, including:

For the i-th row in the prediction quality matrix, a target element q im greater than t p is selected from the elements in the row, and the element X im corresponding to the target element in the i-th row of the initial label assignment matrix is set as The first value; wherein, q im is greater than the other elements q iu that are not selected in this row of elements;

For the element q iu other than q im in the i-th row, if the q iu is less than or equal to t p and greater than or equal to t n , set the element X iu of the q iu corresponding to the label assignment initial matrix to be equal to the third value ; If the qi iu is less than t n , set the element X iu of the corresponding label assignment initial matrix of the qi iu to be equal to the second value;

Check whether the elements in the jth column of the initial label allocation matrix have conflicting elements; wherein, the conflicting elements are more than 2, and the elements are all the first value;

If there is a conflicting element, obtain the prediction quality corresponding to the conflicting element in the prediction quality matrix, keep the element with the highest prediction quality in the conflicting element as the first value, and modify the rest of the elements to the third value to obtain label assignment matrix;

t p >t n , t p and t n are preset thresholds respectively, the first value represents positive samples, the second value represents negative samples, and the third value represents ignore samples.
The method according to any one of claims 1-7, wherein the step of calculating the loss function value of the student network model according to the label assignment information and the student model detection result includes:

For each first reference position corresponding to each pixel in the first feature map, the following steps are performed:

determining the target ground truth frame of the first reference position based on the label assignment information of the first reference position;

Calculate the classification loss function value and the regression loss function value based on the target ground-truth box of the first reference position and the score of the first reference position;

The loss function value of the student network model is determined based on the classification loss function value and the regression loss function value of each of the first reference positions.
A target detection device, characterized in that the device comprises:

an image acquisition module, configured to acquire an image to be detected;

A target detection module, configured to input the image to be detected into a target detection model to obtain a target detection result; the target detection result includes the position and score of the bounding box corresponding to the target; wherein, the target detection model is trained in the following manner:

Input the image samples in the image sample set into the student network model, and obtain the student model detection result corresponding to each pixel of the first feature map of the image sample; wherein, the image sample is marked with a target ground-truth frame, and the The student model detection result includes the score of the first reference position corresponding to each pixel of the first feature map and the coordinate information corresponding to the first reference position;

Obtain the teacher model detection result of the image sample by the teacher network model; wherein, the teacher network model is a pre-trained model, and the teacher model detection result includes each pixel corresponding to the second feature map of the image sample. The score of the second reference position corresponding to the point and the coordinate information corresponding to the second reference position; wherein, the reference position numbers and/or position points of the first feature map and the second feature map are the same;

Determine the label assignment information of the image sample according to the detection result of the teacher model;

Calculate the loss function value of the student network model according to the label assignment information and the student model detection result;

Adjust the parameters of the student network model based on the loss function value and continue training until the trained student network model is obtained;

The trained student network model is used as the target detection model.
The device according to claim 9, wherein the target detection module is further configured to:

For each of the second reference positions, calculate the overlap ratio of the second prediction frame corresponding to the second reference position and each target ground-truth frame of the image sample to obtain the matrix IoU:

Wherein, i takes the value [1, N], j takes the value [1, A], N is the number of the labeled ground truth boxes, and A is the number of the second reference positions included in the second feature map;

Based on the overlap ratio of the second reference position and each of the target ground-truth boxes and the score of the second reference position, determining that the second reference position corresponds to a target for each of the target ground-truth boxes The prediction quality; wherein, the prediction quality is used to characterize the probability that what the second reference position detects is the target corresponding to the target ground truth frame;

The label assignment information of each of the first reference positions is determined based on the prediction quality of each of the second reference positions for the target corresponding to each of the target ground-truth boxes,

Use the formula q ij =(s j ) 1-α *(IoU ij ) α to calculate the prediction quality of each of the second reference positions for the target corresponding to each of the target ground-truth boxes, to obtain a prediction quality matrix Q; wherein, q ij is [0,1], α is a preset hyperparameter with a value in the [0,1] interval, s j is the score of the jth second reference position, and IoU ij is the jth The overlap ratio of the second prediction frame corresponding to the two reference positions and the i-th target truth frame is the element of the i-th row and the j-th column in the matrix IoU;
The apparatus according to claim 10, wherein the target detection module is further configured to:

Use the formula q ij =(s j ) 1-α *(IoU ij ) α to calculate the prediction quality of each of the second reference positions for the target corresponding to each of the target ground-truth boxes, to obtain a prediction quality matrix Q; wherein, q ij is [0,1], α is a preset hyperparameter with a value in the [0,1] interval, s j is the score of the jth second reference position, and IoU ij is the jth The overlap ratio of the second prediction frame corresponding to the two reference positions and the i-th target truth frame is the element of the i-th row and the j-th column in the matrix IoU;
The device according to claim 10, wherein the image sample is further marked with the target type corresponding to each target ground-truth frame; the target detection module is further configured to:

Use the formula q ij =(s ij ) 1-α *(IoU ij ) α to calculate the prediction quality of each of the second reference positions for the target corresponding to each of the target ground-truth boxes, and obtain a prediction quality matrix Q; wherein, q ij is [0, 1], α is a preset hyperparameter with a value in the [0, 1] interval, s ij is the score corresponding to the current target type in the score of the j-th second reference position, so The current target type refers to the target type corresponding to the i-th target ground-truth frame, and IoU ij is the overlap ratio between the second prediction frame corresponding to the j-th second reference position and the i-th target ground-truth frame, and is the matrix IoU The element in the i-th row and the j-th column;
The device according to any one of claims 10-12, wherein the target detection module is further configured to:

For each of the second reference positions, the maximum prediction quality is selected from the prediction quality of the target corresponding to each of the target ground-truth boxes at the second reference position;

judging whether the maximum predicted quality is greater than or equal to a first preset quality value;

If yes, assign the positive sample label of the target corresponding to the maximum prediction quality to the second reference position.
The device according to claim 11 or 12, wherein the target detection module is further configured to:

For the jth column in the prediction quality matrix, select the element q mj with the largest value from the elements of the column;

If q mj is greater than t p , set the element X mj in the label assignment matrix corresponding to q mj equal to the first value; for the elements q ij in the jth column except q mj , if q ij is less than t n , set q ij corresponding to The element X ij in the label assignment matrix of is equal to the second value; if q ij is less than or equal to t p and greater than or equal to t n , set the corresponding X ij of q ij equal to the third value;

If q mj is less than t n , set the element X ij in the j-th column of the label assignment matrix equal to the second value;

If q mj is less than or equal to t p and greater than or equal to t n , set the element X mj in the label assignment matrix corresponding to q mj equal to the third value; for the elements q ij in the jth column other than q mj , if q ij is less than t n , set the element X ij in the label assignment matrix corresponding to q ij equal to the second value; if q ij is less than or equal to t p and greater than or equal to t n , set q ij corresponding to the element X ij in the label assignment matrix element X ij is equal to said third value;

t p >t n , t p and t n are preset thresholds respectively, the first value represents positive samples, the second value represents negative samples, and the third value represents ignore samples.
The device according to claim 11 or 12, wherein the target detection module is further configured to:

For the i-th row in the prediction quality matrix, a target element q im greater than t p is selected from the elements in the row, and the element X im corresponding to the target element in the i-th row of the initial label assignment matrix is set as The first value; wherein, q im is greater than the other elements q iu that are not selected in this row of elements;

For the element q iu other than q im in the i-th row, if the q iu is less than or equal to t p and greater than or equal to t n , set the element X iu of the q iu corresponding to the label assignment initial matrix to be equal to the third value ; If the qi iu is less than t n , set the element X iu of the corresponding label assignment initial matrix of the qi iu to be equal to the second value;

Check whether the elements in the jth column of the initial label allocation matrix have conflicting elements; wherein, the conflicting elements are more than 2, and the elements are all the first value;

If there is a conflicting element, obtain the prediction quality corresponding to the conflicting element in the prediction quality matrix, keep the element with the highest prediction quality in the conflicting element as the first value, and modify the rest of the elements to the third value to obtain label assignment matrix;

t p >t n , t p and t n are preset thresholds respectively, the first value represents positive samples, the second value represents negative samples, and the third value represents ignore samples.
The device according to any one of claims 9-15, wherein the target detection module is further configured to:

For each first reference position corresponding to each pixel in the first feature map, the following steps are performed:

determining the target ground truth frame of the first reference position based on the label assignment information of the first reference position;

Calculate the classification loss function value and the regression loss function value based on the target ground-truth box of the first reference position and the score of the first reference position;

The loss function value of the student network model is determined based on the classification loss function value and the regression loss function value of each of the first reference positions.
An electronic device, comprising a memory, a processor, and a computer program stored on the memory and running on the processor, characterized in that, when the processor executes the computer program, the above claims 1- 8. The steps of any one of the methods.
A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is run by a processor, the steps of the method according to any one of the preceding claims 1-8 are executed. .