CN111241947B

CN111241947B - Training method and device for target detection model, storage medium and computer equipment

Info

Publication number: CN111241947B
Application number: CN201911422532.3A
Authority: CN
Inventors: 岑俊毅; 李立赛; 傅东生
Original assignee: Shenzhen Miracle Intelligent Network Co Ltd
Current assignee: Shenzhen Miracle Intelligent Network Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2023-07-18
Anticipated expiration: 2039-12-31
Also published as: CN111241947A

Abstract

The application relates to a training method, a training device, a computer readable storage medium and computer equipment of a target detection model, wherein the training method comprises the following steps: acquiring a feature map of a sample image during training, and determining an initial detection frame in the feature map according to a preset rotation angle, a preset scale and a preset target aspect ratio; the position of each initial detection frame is adjusted to obtain the position information of the prediction detection frame, and the network parameters of the regression network are adjusted according to the position information and the real position information in the labeling information of the sample image; predicting the prediction probability of the target corresponding to each preset category according to the target detection area determined by the position information of the prediction detection frame; and adjusting network parameters of the classification network according to the real category information and the prediction probability in the labeling information of the sample image to obtain a target detection model for target detection of the image. According to the scheme provided by the application, the target detection model can identify the rotation angle of the target in the image, and the positioned target detection frame is more accurate.

Description

Training method and device for target detection model, storage medium and computer equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a training method and apparatus for a target detection model, a computer readable storage medium, and a computer device.

Background

The object detection is also called object extraction, is an image segmentation technology in the field of computer vision, and can not only segment objects from images, namely locate the positions of the objects, but also identify the types of the objects.

When the position regression network of the target detection model is trained, a sliding window method is generally adopted to traverse the areas in the image, and then the areas are screened and examined to serve as candidate rectangular areas for target detection, however, the inventor realizes that the candidate rectangular areas are generally horizontal rectangular areas, and can accurately and effectively locate the targets horizontally placed and normally placed in the image, but when rotating targets or targets with irregular appearance exist in the image, the target detection frames determined according to the candidate areas are not accurate enough, for example, when an elongated object (such as a pencil) forms a certain angle with a horizontal line to be displayed in the image, if the horizontal rectangular frames are adopted for marking, the background area in the located target detection frames is far larger than the area of the targets, so that the target location is inaccurate, and the target recognition rate is lower.

Disclosure of Invention

Based on the above, there is a need to provide a training method, a training device, a computer-readable storage medium and a computer device for an object detection model, which solve the technical problems of inaccuracy and low recognition rate in the existing method for positioning a rotating object or an irregularly shaped object in an image by using the object detection model.

A training method of a target detection model, comprising:

acquiring a sample image and annotation information, wherein the annotation information comprises real position information and real category information of a target in the sample image, and the real position information comprises a rotation angle of a rectangular bounding box corresponding to the target;

obtaining a feature map of the sample image through a feature extraction network of the initial model;

determining an initial detection frame in the feature map according to a preset rotation angle, a preset scale and a preset target aspect ratio through a region generation network of the initial model;

the position of each initial detection frame is adjusted through a regression network of the initial model, the position information of the prediction detection frame is obtained, and network parameters of the regression network are adjusted according to the real position information in the marking information and the position information of the prediction detection frame;

Predicting the prediction probability of each preset category corresponding to the target according to the target detection area determined by the position information of each prediction detection frame through a classification network of the initial model;

and adjusting network parameters of the classification network according to the real category information in the labeling information and the prediction probability to obtain a target detection model for detecting the target of the image.

A training apparatus for a target detection model, the apparatus comprising:

In one embodiment, the acquiring the sample image includes: acquiring an original sample image; judging whether the aspect ratio of the original sample image is 1; if yes, scaling the original sample image to a preset size in equal proportion to obtain a sample image; if not, the original sample image is scaled in equal proportion and then the image pixels are supplemented, and a sample image with a preset size is obtained.

In one embodiment, the acquiring the sample image includes: acquiring an original sample image; performing rotation processing on the original sample image according to a preset angle to obtain a sample image, and obtaining real labeling information of the sample image according to the rotation angle of a rectangular bounding box in the original sample image and the preset angle; or performing vertical mirror image processing on the original sample image to obtain a sample image, and obtaining real annotation information of the sample image according to the rotation angle of a rectangular bounding box in the original sample image; or performing horizontal mirroring on the original sample image to obtain a sample image, and obtaining real annotation information of the sample image according to the rotation angle of the rectangular bounding box in the original sample image.

In one embodiment, the step of determining the target aspect ratio comprises: acquiring a sample image and width and height information of a rectangular bounding box corresponding to a target in each sample image; counting the aspect ratio of each rectangular bounding box according to the width-height information; and clustering the counted aspect ratios to obtain target aspect ratios in the clustering result.

In one embodiment, the adjusting the position of each initial detection frame to obtain the position information of the predicted detection frame includes: calculating the position offset of each initial detection frame according to the current network parameters of the regression network; obtaining the position information of a prediction detection frame according to the initial detection frame and the position offset; the position information comprises coordinates of a geometric center point of the prediction detection frame, a width and a height of the prediction detection frame and a rotation angle of the prediction detection frame.

In one embodiment, the method further comprises: determining the prediction detection frame according to the position information of the prediction detection frame; determining a rectangular bounding box corresponding to the target in the sample image according to the real position information; calculating the intersection ratio between the prediction detection frame and the rectangular surrounding frame; calculating a rotation angle difference between the prediction detection frame and the rectangular bounding frame; when the intersection ratio is greater than a first threshold value and the rotation angle difference is smaller than a second threshold value, marking the sample image as a positive sample image; and when the intersection ratio is smaller than a third threshold value or the rotation angle difference is larger than a second threshold value, marking the sample image as a negative sample image.

In one embodiment, the predicting, by the target detection area determined according to the position information of each of the prediction detection frames, the prediction probability of the target corresponding to each preset category includes: determining a target detection area on the feature map according to the position information of each prediction detection frame; after adjusting each target detection area to the same preset scale, obtaining a feature vector corresponding to each target detection area; and determining the prediction probability of the target detection area corresponding to each preset category according to the feature vector.

A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the training method of the object detection model described above.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the training method of the object detection model described above.

According to the training method, the device, the computer readable storage medium and the computer equipment of the target detection model, on one hand, when the target detection model is trained, the labeling information of the sample image comprises real position information and real category information, the real position information comprises a rotation angle, so that the target detection model obtained through training can have the capability of identifying the rotation angle of a target in the image, and a positioned target detection frame is more accurate. On the other hand, in the process of training the target detection model, the rotation angle, the scale and the target aspect ratio for generating the initial detection frame in the area generation network are initialized, so that the generation mode of the initial detection frame is enriched, the target detection model obtained through training is more stable, and the initial detection frame is also more similar to the real target detection frame because the initial detection frame is determined according to the preset rotation angle. In this way, the position of the initial detection frame is adjusted through the regression network to obtain the prediction detection frame, the target detection area on the feature map is obtained according to the prediction detection frame, the network parameters of the regression network are adjusted according to the real position information in the labeling information and the position information of the prediction detection frame, and after the class probability of the target detection area is predicted through the classification network, the network parameters of the classification network are adjusted according to the real class information in the labeling information and the prediction probability, so that a target detection model which can detect a rotating target in an image and has more accurate target positioning can be obtained.

Drawings

FIG. 1 is an application environment diagram of a training method of a target detection model in one embodiment;

FIG. 2 is a flow chart of a training method of a target detection model in one embodiment;

FIG. 3 is a schematic illustration of labeling a sample image in one embodiment;

FIG. 4 is a flow diagram of labeling a sample image in one embodiment;

FIG. 5 is a schematic diagram of an enhancement process performed on an original sample image to obtain a sample image in one embodiment;

FIG. 6 is a schematic diagram of an initial detection box determined from a feature map in one embodiment;

FIG. 7 is a flow chart of a training method of the object detection model in one embodiment;

FIG. 8 is a block diagram of a training apparatus for an object detection model in one embodiment;

FIG. 9 is a block diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

FIG. 1 is a diagram of an application environment for a training method of an object detection model in one embodiment. Referring to fig. 1, the training method of the target detection model is applied to a training system of the target detection model. The training system of the object detection model may include a terminal 110 and a server 120. The terminal 110 and the server 120 may be connected through a network, and the terminal 110 may be a desktop terminal or a mobile terminal, and the mobile terminal may be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.

Specifically, the terminal 110 may acquire a sample image and transfer the sample image to the server 120. After obtaining the sample image, the server 120 trains the initial model with the sample image to obtain a target detection model for target detection of the image.

In one embodiment, the server 120 may obtain a sample image and labeling information, where the labeling information includes real position information and real category information of a target in the sample image, and the real position information includes a rotation angle of a rectangular bounding box corresponding to the target; obtaining a feature map of the sample image through a feature extraction network of the initial model; determining an initial detection frame in the feature map according to a preset rotation angle, a preset scale and a preset target aspect ratio through a region generation network of the initial model; the position of each initial detection frame is adjusted through a regression network of the initial model, the position information of the prediction detection frame is obtained, and the network parameters of the regression network are adjusted according to the real position information in the labeling information and the position information of the prediction detection frame; predicting the prediction probability of the target corresponding to each preset category according to the target detection area determined by the position information of each prediction detection frame through the classification network of the initial model; and adjusting network parameters of the classification network according to the real category information and the prediction probability in the labeling information to obtain a target detection model for carrying out target detection on the image.

In one embodiment, as shown in FIG. 2, a training method for a target detection model is provided. The method is described as being applied to a computer device (such as the terminal 110 or the server 120 in fig. 1 described above) as an example. The method may include the following steps S202 to S212.

S202, acquiring a sample image and labeling information, wherein the labeling information comprises real position information and real category information of a target in the sample image, and the real position information comprises a rotation angle of a rectangular bounding box corresponding to the target.

The sample image is an image used for training an initial model, and the model obtained through training of the sample image has the capability of carrying out target detection on the image. The target detection not only needs to divide the target from the image, namely locate the target position, but also can identify the category of the target. The class information of the target in the sample image may be one or more of a plurality of preset classification classes, and the preset classification classes may be preset according to actual application requirements, for example, may be a face, a vehicle, an animal, a vehicle, etc. The position information of the target in the sample image may be represented by position information of a rectangular bounding box bounding the target, such as an x-coordinate, a y-coordinate, a width w of the rectangular bounding box, and a height h of the rectangular bounding box, which does not change after the rectangular bounding box rotates around the geometric center. In addition, in the embodiment provided in the present application, the position information of the sample image further includes the rotation angle θ of the rectangular bounding box corresponding to the target, that is, the labeling information of the sample image may be represented by such a set of data including x, y, w, h and θ. The rotation angle θ may be an offset angle of the rectangular bounding box relative to the horizontal placement of the rectangular bounding box, for example, an angle between a long side of the rectangular bounding box and the positive direction of the x-axis of the sample image, or an angle between a long side of the rectangular bounding box and the positive direction of the y-axis of the sample image, or an angle between a short side of the rectangular bounding box and the positive direction of the x-axis of the sample image, or an angle between a short side of the rectangular bounding box and the positive direction of the y-axis of the sample image. The rotation angle may take any value between 0 degrees and 360 degrees. It can be appreciated that, because the labeling information of the sample image includes the rotation angle of the sample image, the model obtained by training the labeled sample image also has the capability of identifying the rotation angle of the target in the image, so that the target in the image can be more accurately positioned according to the rotation angle.

FIG. 3 is a schematic illustration of labeling a sample image in one embodiment. Referring to fig. 3, the object in the sample image is a pencil, wherein the left side of fig. 3 is a schematic diagram for labeling the object in the sample image in the conventional technology, the rectangular bounding box in the drawing is a horizontal rectangular box, and the horizontal rectangular box includes a large amount of background information, and the background information is even larger than the object information, which can cause a decrease in recognition rate and an inaccurate positioning of the object. The right side of fig. 3 is a schematic diagram for labeling a target in a sample image according to an embodiment of the present application, where a rectangular bounding box in the drawing is a rectangular box with a rotation angle, and the rectangular bounding box more accurately indicates a position of the target in the image.

The initial model may be a machine learning model that may learn through the sample image, thereby providing the ability to identify the image. In embodiments provided herein, a computer device may learn the ability to target an image from a sample image. In one embodiment, the computer device may set a model structure of the machine learning model in advance to obtain an initial model, and train the initial model through the sample image to obtain model parameters of the machine learning model. When the image is required to be subjected to target detection, the computer equipment can acquire model parameters obtained through training in advance, and then the model parameters are imported into an initial model to obtain a target detection model with the capability of carrying out target detection on the image.

In one embodiment, the training sample may be expanded prior to labeling the sample image, and obtaining the sample image includes: acquiring an original sample image; judging whether the aspect ratio of the original sample image is 1; if yes, scaling the original sample image to a preset size in equal proportion to obtain a sample image; if not, the original sample image is scaled in equal proportion and then the image pixels are supplemented, and a sample image with a preset size is obtained.

Since the input images of the feature extraction network need to have the same size, the input and output of the whole network are both of fixed size, so that the sample images need to be preprocessed. Specifically, whether the width and the height of the original sample image are the same is firstly determined, if yes, the width and the height of the original sample image are scaled to a preset size, and the preset size can be s×s, for example. If the widths and heights of the original sample images are different, when the widths are larger than the heights, firstly scaling the widths of the original sample images to a preset size S, then scaling the heights of the sample images to S' according to the aspect ratio of the original sample images, and supplementing pixels to the upper area or the lower area of the original sample images to enable the heights of the original sample images to be S, so that sample images with the sizes of S x S are obtained; when the height is larger than the width, the original sample image is scaled to a preset size S, then the width of the original sample image is scaled to S' according to the aspect ratio of the original sample image, and then pixels are added to the left area or the right area of the original sample image to make the width of the original sample image be S, so that the sample image with the size S is obtained. The equal-proportion scaling is to ensure that the target in the sample image is not deformed, and the length or width of the picture is supplemented by pixels in order to meet the requirement of the consistent size of the sample image and insufficient length or width of the picture.

FIG. 4 is a flow chart illustrating labeling of a sample image in one embodiment. Referring to fig. 4, the method comprises the steps of:

s402, acquiring an original sample image;

s404, judging whether the aspect ratio of the original sample image is 1; if yes, go to step S406; if not, go to step S408;

s406, scaling the width-height equal proportion of the original sample picture to a preset size S;

s408, judging whether the width of the original sample image is larger than the height; if yes, go to step S410a; if not, executing step S410b;

s410a, scaling the width of the original sample image to a preset size S, and then scaling the height of the sample image to S' according to the aspect ratio of the original sample image;

s412a, after the upper area or the lower area of the original sample image is supplemented with pixels, making the height of the original sample image be S, and obtaining a sample image of S;

s410b, scaling the high of the original sample image to a preset size S, and then scaling the wide of the original sample image to S' according to the aspect ratio of the original sample image;

s412b, after the pixels are added to the left area or the right area of the original sample image, the width of the original sample image is S, and a sample image of S x S is obtained;

And S414, labeling the adjusted sample image.

In one embodiment, the method further includes the step of obtaining the target aspect ratio by counting the aspect ratio of the rectangular bounding box marked in the sample image: acquiring width and height information of a rectangular bounding box corresponding to a target in each sample image; counting the aspect ratio of each rectangular bounding box according to the width-height information; and clustering the counted aspect ratios to obtain target aspect ratios in the clustering result.

Specifically, after labeling the sample image, the computer device may obtain width and height information of a rectangular bounding box labeled in the sample image, calculate an aspect ratio, and cluster the calculated aspect ratio by adopting a clustering algorithm, where the number of clustered categories may be set according to needs, for example, 3 clusters may be performed by adopting a K-means algorithm to obtain 3 aspect ratio case values, and the obtained target aspect ratio includes w1:h1, w2:h2, and w3:h3. For example, 1:2, 1:3, 1:4.

It should be noted that the target aspect ratio obtained by the computer device is used to determine the initial detection frames from the feature map of the sample image in step S206, the more kinds of the target aspect ratios, the more the number of initial detection frames are determined.

In an embodiment, the method further comprises the step of enhancing the sample image, i.e. the step of obtaining the sample image comprises: acquiring an original sample image; performing rotation processing on an original sample image according to a preset angle to obtain the sample image, and obtaining real labeling information of the sample image according to the rotation angle and the preset angle of a rectangular bounding box in the original sample image; or performing vertical mirror image processing on the original sample image to obtain a sample image, and obtaining real annotation information of the sample image according to the rotation angle of the rectangular bounding box in the original sample image; or performing horizontal mirroring on the original sample image to obtain a sample image, and obtaining the real annotation information of the sample image according to the rotation angle of the rectangular bounding box in the original sample image.

Specifically, the computer device may rotate the original sample image according to a preset angle, where the preset angle may be, for example, 30 °, 60 °, and the like, and the computer device may further perform vertical mirroring or horizontal mirroring on the original sample image, and may further perform vertical mirroring or horizontal mirroring on the sample image processed according to the preset rotation angle, to obtain a new sample image. The rotation angle in the corresponding labeling information of the processed sample image needs to be modified correspondingly, so that the obtained new sample image can be added into a training sample library for training an initial model.

FIG. 5 is a schematic diagram of an enhancement process performed on an original sample image to obtain a sample image in one embodiment. Referring to fig. 5, a schematic view of an enhanced sample image may be obtained by performing a rotation process, a mirroring process, and further mirroring the rotation-processed picture on the original sample image.

In this embodiment, the richness of the sample image can be improved by performing image enhancement processing on the sample image, and the target detection model with higher accuracy and higher stability can be obtained by training the initial model by using the sample image.

S204, obtaining a feature map of the sample image through a feature extraction network of the initial model.

Wherein the feature map may be used to reflect characteristics of the sample image. The target in the sample image can be positioned according to the characteristics of the sample image, and the category to which the sample image belongs can be classified. The initial model comprises a feature extraction network, a region generation network, a regression network and a classification network, wherein in the process of training the initial model, the computer equipment can input a sample image into the feature extraction network of the initial model, and the feature extraction network is used for extracting the image features of the sample image to obtain a feature map. The network parameters in the feature extraction network may be determined by prior training, and the parameters of the feature extraction network are kept unchanged during the training process. The feature extraction network may be, for example, a convolutional neural network. In addition, the initial model can be built based on the network architecture of the fast RCNN.

S206, determining an initial detection frame in the feature map according to a preset rotation angle, a preset scale and a preset target aspect ratio through a region generation network of the initial model.

The area generating network extracts an initial detection frame with a rotation angle from the characteristic diagram. Specifically, for each position point belonging to the foreground on the feature map, a corresponding initial detection frame is generated according to a preset rotation angle, a preset scale and a preset target aspect ratio. It will be appreciated that if the preset rotation angle includes m types, the preset scale includes n types, and the target aspect ratio includes k types, in combination, m×n×k initial detection frames may be generated for each position point.

For example, the preset rotation angle includes 8 rotation angles {0, 45 °,90 °,135 °,180 °,225 °,270 °,315 ° }, respectively. The predetermined scale includes 3 scales {128×128, 256×256, 512×512} respectively. The aspect ratio of the target is determined by counting the aspect ratio of the rectangular bounding box marked in the sample image, and the aspect ratio of the target is obtained by analyzing the real aspect ratio of the target in the sample image, so that the width and the height of the initial detection frame determined according to the aspect ratio of the target are more closely related to the real width and the height of the target, the detection network storage can be quickened, and the accuracy is improved. For example, the target aspect ratio may be 3 kinds { w1:h1, w2:h2, w3:h3} respectively, so that 8×3×3=72 different initial detection frames can be generated for each location point.

FIG. 6 is a schematic diagram of an initial detection box determined from a feature map in one embodiment. Referring to fig. 6, the feature map has dimensions s×s, and for a certain position point (X0, Y0) on the feature map, the area generating network extracts 72 initial detection frames, and fig. 6 only shows 6 initial detection frames, where the rotation angles, dimensions and aspect ratios corresponding to the 6 initial detection frames are respectively:

0°、128*128、1；

45°、128*128、1；

90°、128*128、1；

45°、256*128、2；

90°、256*256、1；

45°、256*512、1/2。

in one embodiment, for an input pair of sample images, pixels belonging to the foreground are obtained in the region generation network through a classification function, so that position points corresponding to the pixels belonging to the foreground on the feature map are determined, and an initial detection frame is generated for each determined position point.

S208, the position of each initial detection frame is adjusted through the regression network of the initial model, the position information of the prediction detection frame is obtained, and the network parameters of the regression network are adjusted according to the real position information in the labeling information and the position information of the prediction detection frame.

The regression network is used for adjusting the position of the generated initial detection frame according to the current network parameters, and obtaining the position information of the adjusted prediction detection frame. The position information of the prediction detection frame also comprises coordinates of a geometric center point of the prediction detection frame, a width and height of the prediction detection frame and a rotation angle. The initial detection frame generally cannot accurately locate the target in the sample image, and the position information of the initial detection frame is adjusted through the current network parameters in the regression network, so that the obtained prediction detection frame is closer to the target detection frame.

In one embodiment, adjusting the position of each initial detection frame to obtain the position information of the prediction detection frame includes: calculating the position offset of each initial detection frame according to the current network parameters of the regression network; obtaining the position information of the prediction detection frame according to the initial detection frame and the position offset; the position information includes coordinates of a geometric center point of the prediction detection frame, a width and height of the prediction detection frame, and a rotation angle of the prediction detection frame.

In the above example, if the area generating network can generate 72 initial detection frames for each foreground position point, the regression network includes the coordinates of the geometric center point, the width and height of the prediction detection frames, and the rotation angles x, y, w, h and θ, respectively, in the position information of the prediction detection frames obtained by adjusting the positions of the initial detection frames, and then the regression network has 72×5=360 output values for each point on the feature map.

In one embodiment, the training method of the object detection model further includes: determining a prediction detection frame according to the position information of the prediction detection frame; determining a rectangular bounding box corresponding to the target in the sample image according to the real position information; calculating the intersection ratio between the prediction detection frame and the rectangular surrounding frame; calculating the rotation angle difference between the prediction detection frame and the rectangular surrounding frame; when the intersection ratio is larger than the first threshold value and the rotation angle difference is smaller than the second threshold value, marking the sample image as a positive sample image; when the intersection ratio is smaller than the third threshold value or the rotation angle difference is larger than the second threshold value, the sample image is marked as a negative sample image.

The overlap ratio refers to the ratio of the overlapping area of the predicted detection frame and the real rectangular bounding frame to the merging area, the overlapping area can be expressed by the number of position points included in the overlapping area of the predicted detection frame and the real rectangular bounding frame, and the merging area can be expressed by the number of position points included in the area after the predicted detection frame and the real rectangular bounding frame are merged. The regression network outputs the position information of the predicted detection frame including the rotation angle of the predicted detection frame, so that the difference between the rotation angle of the predicted detection frame output by the regression network and the rotation angle of the true rectangular bounding frame can be determined. The difference of the intersection ratio and the rotation angle can reflect the accuracy of the prediction detection frame to a certain extent, and the larger the intersection ratio is, the higher the overlapping degree of the two is, and the smaller the difference of the rotation angle is, the closer the positions of the two are. If the intersection ratio between the prediction detection frame and the real rectangular bounding frame is greater than a first threshold value and the difference of the rotation angle of the prediction detection frame and the real rectangular bounding frame is smaller than a second threshold value, the prediction detection frame is closer to the real rectangular bounding frame, and the sample image can be marked as a positive sample image. When the intersection ratio is smaller than the third threshold or the difference between the rotation angles is larger than the second threshold, the sample image is marked as a negative sample image. Wherein the first threshold may be 0.7, the second threshold may be 22.5 °, and the third threshold may be 0.3. If the difference between the intersection ratio or the rotation angle of the prediction detection frame and the real rectangular bounding frame meets other conditions, the sample image does not belong to a positive sample image or a negative sample image, and is not used for training.

In this embodiment, since the number of foreground position points is greater, the number of initial detection frames determined according to each foreground position point is greater, and the number of prediction detection frames obtained by regression is greater, in order to reduce the data amount in the training process, sample images may be screened according to the above method, and only the screened sample images are used to train the model.

In one embodiment, after obtaining the predicted detection frames, the computer device may also filter all the predicted detection frames according to the overlapping degree of the predicted detection frames in order to reduce the calculation amount of the training process. The computer device may also cull out the prediction detection box beyond the image boundary.

Further, after the computer device obtains the position information of the target in the sample image, the computer device can adjust the network parameters of the regression network according to the difference between the real position information of the target in the labeling information and the position information of the prediction detection frame.

S210, predicting the prediction probability of the target corresponding to each preset category according to the target detection area determined by the position information of each prediction detection frame through the classification network of the initial model.

Specifically, the input of the classification network includes a feature map and the determined position information of the prediction detection frame, and the computer device may determine a target detection area from the feature map according to the position information of the prediction detection frame, and predict the class of the sample image based on the target detection area.

In one embodiment, the predicting probability of the target detection area prediction target corresponding to each preset category according to the position information of each prediction detection frame includes: determining a target detection area on the feature map according to the position information of each prediction detection frame; after adjusting each target detection area to the same preset scale, obtaining a feature vector corresponding to each target detection area; and determining the prediction probability of the target detection area corresponding to each preset category according to the feature vector.

Specifically, the computer device may cut out target detection areas with inconsistent sizes from the feature map according to the position information of each prediction detection frame, adjust each target detection area to the same preset scale through ROI Pooling (Region of Interests Pooling, candidate area Pooling), obtain feature vectors corresponding to each target detection area, and determine probability vectors of each target detection area belonging to each preset category through the full connection layer and the normalization layer, thereby obtaining the prediction probability corresponding to each preset category.

S212, adjusting network parameters of the classification network according to the real category information and the prediction probability in the labeling information, and obtaining a target detection model for target detection of the image.

Finally, after the class probability of the preset class corresponding to the prediction detection frame in the sample image is determined, a loss function of the classification network can be constructed according to the class probability and the real class information of the target in the sample image, and the network parameters of the classification network can be adjusted according to the adjustment direction when the loss function is minimized. For all sample images, the computer device may perform the above steps S202 to S212 on the current model until an object detection model capable of object detection on the image is obtained.

According to the training method of the target detection model, on one hand, when the target detection model is trained, the labeling information of the sample image comprises the real position information and the real type information, the real position information comprises the rotation angle, so that the target detection model obtained through training can have the capability of identifying the rotation angle of the target in the image, and the positioned target detection frame is more accurate. On the other hand, in the process of training the target detection model, the rotation angle, the scale and the target aspect ratio for generating the initial detection frame in the area generation network are initialized, so that the generation mode of the initial detection frame is enriched, the target detection model obtained through training is more stable, and the initial detection frame is also more similar to the real target detection frame because the initial detection frame is determined according to the preset rotation angle. In this way, the position of the initial detection frame is adjusted through the regression network to obtain the prediction detection frame, the target detection area on the feature map is obtained according to the prediction detection frame, the network parameters of the regression network are adjusted according to the real position information in the labeling information and the position information of the prediction detection frame, and after the class probability of the target detection area is predicted through the classification network, the network parameters of the classification network are adjusted according to the real class information in the labeling information and the prediction probability, so that a target detection model which can detect a rotating target in an image and has more accurate target positioning can be obtained.

In a specific embodiment, as shown in fig. 7, the training method of the target detection model includes the following steps:

s702, acquiring an original sample image.

S704, acquiring the width and height information of a rectangular bounding box corresponding to the target in each original sample image.

S706, the aspect ratio of each rectangular bounding box is counted according to the width and height information.

S708, clustering the counted aspect ratios to obtain target aspect ratios in the clustering result.

S710, judging whether the aspect ratio of the original sample image is 1; if yes, scaling the original sample image to a preset size in equal proportion to obtain a sample image; if not, the original sample image is scaled in equal proportion and then the image pixels are supplemented, and a sample image with a preset size is obtained.

S712, performing rotation processing on the sample image according to a preset angle to obtain a new sample image.

S714, performing vertical mirror image processing on the sample image to obtain a new sample image.

S716, performing horizontal mirroring on the sample image to obtain a new sample image.

S718, obtaining the real labeling information of the newly added sample image according to the rotation angle of the rectangular bounding box in the sample image.

S720, obtaining a feature map of the sample image through a feature extraction network of the initial model.

S722, determining an initial detection frame in the feature map according to the preset rotation angle, scale and target aspect ratio through the area generation network of the initial model.

S724, calculating the position offset of each initial detection frame according to the current network parameters of the regression network through the regression network of the initial model; obtaining the position information of the prediction detection frame according to the initial detection frame and the position offset; the position information includes coordinates of a geometric center point of the prediction detection frame, a width and height of the prediction detection frame, and a rotation angle of the prediction detection frame.

S726, adjusting network parameters of the regression network according to the real position information in the labeling information and the position information of the prediction detection frame.

S728, determining the prediction detection frame according to the position information of the prediction detection frame.

S730, determining a rectangular bounding box corresponding to the target in the sample image according to the real position information.

S732, calculating the intersection ratio and the rotation angle difference between the prediction detection frame and the rectangular surrounding frame.

And S734, when the intersection ratio is larger than the first threshold value and the rotation angle difference is smaller than the second threshold value, marking the sample image as a positive sample image.

And S736, when the intersection ratio is smaller than the third threshold value or the rotation angle difference is larger than the second threshold value, marking the sample image as a negative sample image.

S738, determining the target detection area on the feature map according to the position information of each prediction detection frame through the classification network of the initial model.

And S740, after the target detection areas are adjusted to the same preset scale, the feature vectors corresponding to the target detection areas are obtained.

S742, determining the prediction probability of each preset category corresponding to the target detection area according to the feature vector.

S744, the network parameters of the classification network are adjusted according to the real category information and the prediction probability in the labeling information, and then a target detection model for target detection of the image is obtained.

FIG. 7 is a flow chart of a training method of the object detection model in one embodiment. It should be understood that, although the steps in the flowchart of fig. 7 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 7 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, or the order in which the sub-steps or stages are performed is not necessarily sequential, but may be performed in rotation or alternatively with at least a portion of the sub-steps or stages of other steps or steps.

In one embodiment, as shown in fig. 8, a training apparatus 800 of a target detection model is provided, the apparatus including a sample image acquisition module 802, a feature map acquisition module 804, an initial detection frame generation module 806, a location regression module 808, and a classification module 810, wherein:

the sample image obtaining module 802 is configured to obtain a sample image and annotation information, where the annotation information includes real position information and real category information of a target in the sample image.

The feature map obtaining module 804 is configured to obtain a feature map of the sample image through a feature extraction network of the initial model.

The initial detection frame generation module 806 is configured to determine an initial detection frame in the feature map according to a preset rotation angle, a preset scale and a preset target aspect ratio through the area generation network of the initial model.

The position regression module 808 is configured to adjust the position of each initial detection frame through the regression network of the initial model, obtain the position information of the prediction detection frame, and adjust the network parameters of the regression network according to the real position information in the labeling information and the position information of the prediction detection frame.

The classification module 810 is configured to predict, according to the target detection area determined by the position information of each prediction detection frame, the prediction probability of the target corresponding to each preset category through the classification network of the initial model; and adjusting network parameters of the classification network according to the real category information and the prediction probability in the labeling information to obtain a target detection model for carrying out target detection on the image.

In one embodiment, the sample image acquisition module 802 is further configured to acquire an original sample image; judging whether the aspect ratio of the original sample image is 1; if yes, scaling the original sample image to a preset size in equal proportion to obtain a sample image; if not, the original sample image is scaled in equal proportion and then the image pixels are supplemented, and a sample image with a preset size is obtained.

In one embodiment, the sample image acquisition module 802 is further configured to acquire an original sample image; performing rotation processing on an original sample image according to a preset angle to obtain the sample image, and obtaining real labeling information of the sample image according to the rotation angle and the preset angle of a rectangular bounding box in the original sample image; or performing vertical mirror image processing on the original sample image to obtain a sample image, and obtaining real annotation information of the sample image according to the rotation angle of the rectangular bounding box in the original sample image; or performing horizontal mirroring on the original sample image to obtain a sample image, and obtaining the real annotation information of the sample image according to the rotation angle of the rectangular bounding box in the original sample image.

In one embodiment, the device further includes a sample image preprocessing module, configured to obtain a sample image and width-height information of a rectangular bounding box corresponding to a target in each sample image; counting the aspect ratio of each rectangular bounding box according to the width-height information; and clustering the counted aspect ratios to obtain target aspect ratios in the clustering result.

In one embodiment, the device further includes a statistics module, configured to obtain the sample image and width and height information of a rectangular bounding box corresponding to the target in each sample image; counting the aspect ratio of each rectangular bounding box according to the width-height information; and clustering the counted aspect ratios to obtain target aspect ratios in the clustering result.

In one embodiment, the apparatus further includes a filtering module, configured to determine a predicted detection frame according to the location information of the predicted detection frame; determining a rectangular bounding box corresponding to the target in the sample image according to the real position information; calculating the intersection ratio between the prediction detection frame and the rectangular surrounding frame; calculating the rotation angle difference between the prediction detection frame and the rectangular surrounding frame; when the intersection ratio is larger than the first threshold value and the rotation angle difference is smaller than the second threshold value, marking the sample image as a positive sample image; when the intersection ratio is smaller than the third threshold value or the rotation angle difference is larger than the second threshold value, the sample image is marked as a negative sample image.

In one embodiment, the classification module is further configured to determine a target detection area on the feature map according to the position information of each prediction detection frame; after adjusting each target detection area to the same preset scale, obtaining a feature vector corresponding to each target detection area; and determining the prediction probability of the target detection area corresponding to each preset category according to the feature vector.

In the training device 800 for the target detection model, on the one hand, when the target detection model is trained, the labeling information of the sample image includes the real position information and the real category information, and the real position information includes the rotation angle, so that the target detection model obtained by training can have the capability of identifying the rotation angle of the target in the image, and the positioned target detection frame is more accurate. On the other hand, in the process of training the target detection model, the rotation angle, the scale and the target aspect ratio for generating the initial detection frame in the area generation network are initialized, so that the generation mode of the initial detection frame is enriched, the target detection model obtained through training is more stable, and the initial detection frame is also more similar to the real target detection frame because the initial detection frame is determined according to the preset rotation angle. In this way, the position of the initial detection frame is adjusted through the regression network to obtain the prediction detection frame, the target detection area on the feature map is obtained according to the prediction detection frame, the network parameters of the regression network are adjusted according to the real position information in the labeling information and the position information of the prediction detection frame, and after the class probability of the target detection area is predicted through the classification network, the network parameters of the classification network are adjusted according to the real class information in the labeling information and the prediction probability, so that a target detection model which can detect a rotating target in an image and has more accurate target positioning can be obtained.

FIG. 9 illustrates an internal block diagram of a computer device in one embodiment. The computer device may in particular be the computer device in fig. 1. As shown in fig. 9, the computer device includes a processor, a memory, and a network interface connected by a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system, and may also store a computer program that, when executed by the processor, causes the processor to implement a training method for the target detection model. The internal memory may also have stored therein a computer program which, when executed by the processor, causes the processor to perform a training method for the object detection model.

It will be appreciated by those skilled in the art that the structure shown in fig. 9 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, the training apparatus of the object detection model provided in the present application may be implemented as a computer program, which may be executed on a computer device as shown in fig. 9. The memory of the computer device may store various program modules of the training apparatus that make up the object detection model, such as the sample image acquisition module 802, the feature map acquisition module 804, the initial detection frame generation module 806, the location regression module 808, and the classification module 810 shown in fig. 8. The computer program of each program module causes the processor to execute the steps in the training method of the object detection model of each embodiment of the present application described in the present specification.

For example, the computer apparatus shown in fig. 9 may perform step S202 by the sample image acquisition module 802 in the training device of the target detection model as shown in fig. 8. The computer apparatus may perform step S204 through the feature map acquisition module 804. The computer device may perform step S206 through the initial detection frame generation module 806. The computer device may perform step S208 via the location regression module 808. The computer device may perform steps S210 and S212 through the classification module 810.

In one embodiment, a computer device is provided that includes a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the training method of the object detection model described above. The step of the training method of the target detection model may be a step in the training method of the target detection model of each of the above embodiments.

In one embodiment, a computer readable storage medium is provided, storing a computer program which, when executed by a processor, causes the processor to perform the steps of the training method of the object detection model described above. The step of the training method of the target detection model may be a step in the training method of the target detection model of each of the above embodiments.

Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A training method of a target detection model, comprising:

the position of each initial detection frame is adjusted through a regression network of the initial model, so that the position information of the prediction detection frame is obtained, and the method comprises the following steps: calculating the position offset of each initial detection frame according to the current network parameters of the regression network; obtaining the position information of a prediction detection frame according to the initial detection frame and the position offset; the position information comprises coordinates of a geometric center point of the prediction detection frame, width and height of the prediction detection frame and rotation angle of the prediction detection frame;

adjusting network parameters of the regression network according to the real position information in the labeling information and the position information of the prediction detection frame;

predicting, by a classification network of an initial model, a prediction probability of each preset category corresponding to the target according to a target detection area determined by position information of each prediction detection frame, including: determining a target detection area on the feature map according to the position information of each prediction detection frame; after adjusting each target detection area to the same preset scale, obtaining a feature vector corresponding to each target detection area; determining the prediction probability of each preset category corresponding to the target detection area according to the feature vector;

2. The method of claim 1, wherein the acquiring a sample image comprises:

acquiring an original sample image;

judging whether the aspect ratio of the original sample image is 1; if yes, scaling the original sample image to a preset size in equal proportion to obtain a sample image; if not, the original sample image is scaled in equal proportion and then the image pixels are supplemented, and a sample image with a preset size is obtained.

3. The method of claim 1, wherein the acquiring a sample image comprises:

acquiring an original sample image;

performing rotation processing on the original sample image according to a preset angle to obtain a sample image, and obtaining real labeling information of the sample image according to the rotation angle of a rectangular bounding box in the original sample image and the preset angle; or alternatively, the process may be performed,

performing vertical mirror image processing on the original sample image to obtain a sample image, and obtaining real annotation information of the sample image according to the rotation angle of a rectangular bounding box in the original sample image; or alternatively, the process may be performed,

And carrying out horizontal mirroring on the original sample image to obtain a sample image, and obtaining real annotation information of the sample image according to the rotation angle of the rectangular bounding box in the original sample image.

4. The method of claim 1, wherein the step of determining the target aspect ratio comprises:

acquiring a sample image and width and height information of a rectangular bounding box corresponding to a target in each sample image;

counting the aspect ratio of each rectangular bounding box according to the width-height information;

and clustering the counted aspect ratios to obtain target aspect ratios in the clustering result.

5. The method of claim 1, wherein the regression network is configured to adjust the position of the generated initial detection frame based on the current network parameters to obtain the adjusted position information of the predicted detection frame.

6. The method according to claim 1, wherein the method further comprises:

determining the prediction detection frame according to the position information of the prediction detection frame;

determining a rectangular bounding box corresponding to the target in the sample image according to the real position information;

calculating the intersection ratio between the prediction detection frame and the rectangular surrounding frame;

Calculating a rotation angle difference between the prediction detection frame and the rectangular bounding frame;

when the intersection ratio is greater than a first threshold value and the rotation angle difference is smaller than a second threshold value, marking the sample image as a positive sample image;

and when the intersection ratio is smaller than a third threshold value or the rotation angle difference is larger than a second threshold value, marking the sample image as a negative sample image.

7. The method of claim 1, wherein the input to the classification network includes a feature map and location information of the determined predictive detection box.

8. A training apparatus for a target detection model, the apparatus comprising:

the sample image acquisition module is used for acquiring a sample image and annotation information, wherein the annotation information comprises real position information and real category information of a target in the sample image;

the feature map acquisition module is used for acquiring a feature map of the sample image through a feature extraction network of the initial model;

the initial detection frame generation module is used for generating a network through the area of the initial model and determining an initial detection frame in the feature map according to a preset rotation angle, a preset scale and a preset target aspect ratio;

A position regression module for passing through a regression network of the initial model,

the position of each initial detection frame is adjusted to obtain the position information of the prediction detection frame, and the method comprises the following steps: calculating the position offset of each initial detection frame according to the current network parameters of the regression network; obtaining the position information of a prediction detection frame according to the initial detection frame and the position offset; the position information comprises coordinates of a geometric center point of the prediction detection frame, width and height of the prediction detection frame and rotation angle of the prediction detection frame;

the classification module is configured to predict, through a classification network of an initial model, a prediction probability of each preset category corresponding to the target according to a target detection area determined by the position information of each prediction detection frame, and includes: determining a target detection area on the feature map according to the position information of each prediction detection frame; after adjusting each target detection area to the same preset scale, obtaining a feature vector corresponding to each target detection area; determining the prediction probability of each preset category corresponding to the target detection area according to the feature vector;

9. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method of any one of claims 1 to 7.

10. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 7.