CN111738133A

CN111738133A - Model training method, target detection method, device, electronic equipment and readable storage medium

Info

Publication number: CN111738133A
Application number: CN202010556608.8A
Authority: CN
Inventors: 刘健
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2020-06-17
Filing date: 2020-06-17
Publication date: 2020-10-02

Abstract

The embodiment of the invention provides a model training method and device, a target detection method and device, electronic equipment and a readable storage medium, wherein the target detection method comprises the following steps: the method comprises the steps of obtaining an image to be detected, generating rectangular frames corresponding to the scale of a preset anchor point according to the image to be detected, inputting the image to be detected into a target lightweight basic model to obtain a basic feature map corresponding to the preset anchor point, inputting the basic feature map into a target lightweight detection model to obtain a detection feature map corresponding to the basic feature map, obtaining category confidence and position corresponding to each rectangular frame according to the detection feature map, and obtaining a target detection result of a target object in the image to be detected according to the category confidence and position corresponding to each rectangular frame to achieve high target detection speed and detection precision.

Description

Model training method, target detection method, device, electronic equipment and readable storage medium

Technical Field

The present invention relates to the field of machine learning technologies, and in particular, to a model training method, a target detection method, an apparatus, an electronic device, and a readable storage medium.

Background

Object detection is widely applied in many fields, for example, to identify a face or an object in a video, and object detection is required to identify whether the face exists in the video content and identify the position of the face, or identify whether an object (such as a vehicle) exists in the video content and identify the position of the object.

The current target detection method can not simultaneously ensure higher detection speed and precision, a VGG network is adopted for target detection in the prior art, the VGG is named as Visual Geometry Group, the VGG network is a convolution network model which is issued by the scientific engineering system of Oxford university and starts with the VGG, although the detection precision can be ensured, the detection speed is slower.

Therefore, the existing target detection method has the defect that higher detection speed and detection precision cannot be ensured at the same time.

Disclosure of Invention

The embodiment of the invention aims to provide a model training method, a target detection device, electronic equipment and a readable storage medium, so as to realize higher target detection speed and detection precision. The specific technical scheme is as follows:

in a first aspect of the present invention, there is provided a model training method, including:

obtaining a target image sample; wherein the target image sample comprises an image sample of at least one resolution;

training a preset detection model by using the target image sample to obtain a target detection model; the preset detection model is built according to the scale of a preset anchor point, the preset detection model comprises a light-weight basic model and a light-weight detection model connected with the light-weight basic model, the target detection model comprises a target light-weight basic model and a target light-weight detection model, the target light-weight basic model is used for obtaining a feature map of the target image sample, and the target light-weight detection model is used for obtaining a target detection result of a target object in the target image sample according to the feature map.

In a second aspect of the present invention, there is also provided a target detection method, including:

acquiring an image to be detected, and generating a rectangular frame corresponding to the scale of a preset anchor point according to the image to be detected;

inputting the image to be detected into the target lightweight basic model to obtain a basic characteristic diagram corresponding to the preset anchor point;

and inputting the basic feature map into the target lightweight detection model to obtain a detection feature map corresponding to the basic feature map, obtaining a category confidence coefficient and a position corresponding to each rectangular frame according to the detection feature map, and obtaining a target detection result of a target object in the image to be detected according to the category confidence coefficient and the position corresponding to each rectangular frame.

In a third aspect of the present invention, there is also provided a model training apparatus, including:

a first obtaining module for obtaining a target image sample; wherein the target image sample comprises an image sample of at least one resolution;

the training module is used for training a preset detection model by adopting the target image sample to obtain a target detection model; the preset detection model is built according to the scale of a preset anchor point, the preset detection model comprises a light-weight basic model and a light-weight detection model connected with the light-weight basic model, the target detection model comprises a target light-weight basic model and a target light-weight detection model, the target light-weight basic model is used for obtaining a feature map of the target image sample, and the target light-weight detection model is used for obtaining a target detection result of a target object in the target image sample according to the feature map.

In a fourth aspect of the present invention, there is also provided an object detection apparatus, comprising:

the first obtaining module is used for obtaining an image to be detected and generating a rectangular frame corresponding to the scale of a preset anchor point according to the image to be detected;

the second obtaining module is used for inputting the image to be detected into the target lightweight basic model to obtain a basic feature map corresponding to the preset anchor point;

and the third obtaining module is used for inputting the basic feature map into the target lightweight detection model to obtain a detection feature map corresponding to the basic feature map, obtaining the category confidence coefficient and the position corresponding to each rectangular frame according to the detection feature map, and obtaining a target detection result of the target object in the image to be detected according to the category confidence coefficient and the position corresponding to each rectangular frame.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to perform any one of the above-described model training methods or any one of the above-described object detection methods.

In yet another aspect of the present invention, there is also provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform any of the above described model training methods or any of the above described object detection methods.

The target detection method provided by the embodiment of the invention comprises the steps of obtaining an image to be detected, generating a rectangular frame corresponding to the scale of a preset anchor point according to the image to be detected, inputting the image to be detected into a target lightweight basic model to obtain a basic feature map corresponding to the preset anchor point, inputting the basic feature map into a target lightweight detection model to obtain a detection feature map corresponding to the basic feature map, obtaining a category confidence coefficient and a position corresponding to each rectangular frame according to the detection feature map, and obtaining a target detection result of a target object in the image to be detected according to the category confidence coefficient and the position corresponding to each rectangular frame. Because the light-weight basic model is adopted, the related parameter calculation amount is small, and the target detection speed can be improved to a certain extent. Meanwhile, the preset detection model is constructed according to the scale of the preset anchor point, so that the constructed preset detection model is suitable for detecting the target with the size similar to that of the preset anchor point, and the target detection model is obtained by training the constructed preset detection model, so that the obtained target detection model is more suitable for detecting the target with the size similar to that of the preset anchor point, and the detection precision of the target is ensured to a certain extent.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

FIG. 1 is a flow chart illustrating steps of a model training method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a detection model provided in an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an SSH detection module provided in an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a context module in an SSH detection module according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating steps of a method for obtaining an image sample according to an embodiment of the present invention;

FIG. 6 is a flowchart illustrating steps of a method for detecting a target according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a model training apparatus according to an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of another model training apparatus provided in an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device provided in an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

Referring to fig. 1, fig. 1 is a flowchart illustrating steps of a model training method according to an embodiment of the present invention. The method can be applied to a server or a terminal, and comprises the following steps:

step 101, obtaining a target image sample; wherein the target image sample comprises an image sample of at least one resolution.

For example, the target image sample includes three resolutions of image samples, including a first image sample of a first resolution, a second image sample of a second resolution, and a third image sample of a third resolution, where the first resolution is greater than the second resolution, and the second resolution is greater than the third resolution.

The first resolution is a large resolution, for example 1080P (1080P equals a resolution of 1920 × 1080 size); the second resolution is a medium-sized resolution, for example 720P (720P equals 1280 × 720 size resolution); the third resolution is a small resolution, and the third resolution is, for example, a resolution of 576 × 288.

In embodiments of the present invention, the target image sample may include image samples of multiple resolutions. Alternatively, the target image sample includes an image sample of one resolution, for example, the target image sample includes an image sample of any one of the first image sample, the second image sample, and the third image sample. The embodiment of the present invention does not specifically limit which resolution image sample the target image sample includes.

Step 102, training a preset detection model by adopting a target image sample to obtain a target detection model; the preset detection model is constructed according to the scale of the preset anchor point, the preset detection model comprises a lightweight basic model and a lightweight detection model connected with the lightweight basic model, the target detection model comprises a target lightweight basic model and a target lightweight detection model, the target lightweight basic model is used for obtaining a feature map of a target image sample, and the target lightweight detection model is used for obtaining a target detection result of a target object in the target image sample according to the feature map.

In the embodiment of the invention, a plurality of preset anchor points (anchors) can be designed, the scale of the anchor points represents the scale range of the target which can be detected by the lightweight detection model, and the larger the scale of the anchors is, the larger the scale range of the target is detected. For example, if the method is used for detecting a human face (taking a target as an example of a human face), the scale of the anchor is equal to 8, the target detection model may detect a human face with a scale of about 8 × 8 pixels, and the designed scale of the anchor is equal to 8 and 16, the target detection model may detect a human face with a scale of about 8 × 8 pixels and a human face with a scale of about 16 × 16 pixels, and when the method is used for detecting a human face, the anchor1 may be equal to 8, the anchor2 may be equal to 16, the anchor3 may be equal to 32, the anchor4 may be equal to 64, the anchor5 may be equal to 128, and the anchor6 may be equal to 256, which are 6 anchors (the number and the scale of the specifically designed anchor may be designed according to actual detection requirements, and the embodiment of the present invention does not specifically limit the number and the scale of the designed anchor).

In the case of including the above 6 anchors, the receptive fields corresponding to the scales of anchors 1, anchors 2, and anchors 3 (in the convolutional neural network, the receptive fields are the size of the regions where the pixels on the feature map (feature map) output by each layer of the convolutional neural network are mapped on the input picture) may be determined according to the scales of anchors 1, anchors 2, and anchors 3, so that the determined receptive fields are suitable for detecting the human face approximated to the scales of anchors 1, anchors 2, and anchors 3. Determining the receptive fields corresponding to the scales of the anchors 4 and 5 according to the scales of the anchors 4 and 5, so that the determined receptive fields are suitable for detecting the human faces similar to the scales of the anchors 4 and 5; and determining a receptive field corresponding to the scale of the anchor6 according to the scale of the anchor6, and enabling the determined receptive field to be suitable for detecting the human face similar to the scale of the anchor 6. Therefore, in the embodiment of the present invention, the convolutional layer to be included in the lightweight base model and the convolutional layer to be included in the lightweight detection model may be constructed according to the determined sizes of the receptive fields, so as to construct the preset detection model.

The constructed predetermined detection model is described herein in connection with fig. 2. Referring to fig. 2, fig. 2 is a schematic structural diagram of a detection model provided in an embodiment of the present invention. The constructed preset detection model is exemplarily described in combination with the above 6 anchors. The lightweight base model includes a model consisting of a first predetermined number of convolutional layers in a mobile network (MobileNet) model. As shown in fig. 2, the 1 st to 13 th (i.e. the first predetermined number is equal to 13, for example) convolutional layers in the MobileNet model are used as the lightweight base model in fig. 2 (the 1 st convolutional layer is denoted as convolutional layer 1, the 2 nd convolutional layer is denoted as convolutional layer 2, and so on, and the 13 th convolutional layer is denoted as convolutional layer 13). It is noted that which convolutional layers in the MobileNet model are selected as the first predetermined number of convolutional layers may be determined according to the size of the designed anchor scale. For example, if the maximum dimension in the designed anchor is equal to 128, the first preset number of convolutional layers may include convolutional layers 1 through 11. The feature map of the target image sample can be obtained by the first preset number of convolutional layers.

Convolutional layers 1 through 13 are described herein with reference to table 1 below. Referring to table 1 below, table 1 shows the size of the feature map input to each convolutional layer, the type of each convolutional layer, the number of convolutional kernels per convolutional layer, and the step size of the convolutional kernels per convolutional layer when a preset detection model is trained using a third image sample. The feature map input to the convolutional layer i +1 is the feature map output by the convolutional layer i (i is equal to 1, 2, 3, … 12), for example, table 1 shows the size of the feature map input by each convolutional layer in the lightweight basic model when training is performed using an image sample of 576 × 288 pixels, and table 1 below shows that the size of the feature map input to the convolutional layer 6 is the size of the feature map output by the convolutional layer 5, and the size of the feature map output by the convolutional layer 5 is 72 × 36 × 32. Wherein 72, 36, 32 sequentially represent the width, height and number of channels of the feature map. Wherein, convolutional layer 1 is a standard convolutional layer, and the convolutional kernel size is 3 × 3, convolutional layer 13 has a convolutional kernel size of 1 × 1, each depth separable convolutional layer includes a depth Convolution (Depthwise Convolution) and a point-by-point Convolution (pointwise Convolution), the convolutional kernel size of the depth Convolution is 3 × 3, and the convolutional kernel size of the point-by-point Convolution is 1 × 1.

TABLE 1

Optionally, the lightweight detection model includes a detection model except for a basic network in a Single Stage Headless (SSH) model and a second preset number of convolutional layers, an nth convolutional layer of the first preset number of convolutional layers is connected to the second preset number of convolutional layers, the second preset number of convolutional layers is connected to a detection module for detecting a target object with a preset size in the lightweight detection model, and N is smaller than the first preset number. And the target lightweight detection model is used for obtaining a target detection result of a target object in the target image sample according to the characteristic diagram. For example, the second preset number is equal to 3, 3 convolutional layers are connected to the detection module 1, the detection module 1 is configured to detect a target object with a preset size, the 5 th convolutional layer (i.e., N is equal to 5) of the first preset number of convolutional layers is connected to the second preset number of convolutional layers (i.e., the feature map output by the 5 th convolutional layer is input to the 1 st convolutional layer of the 3 convolutional layers), the feature map output by the 1 st convolutional layer of the 3 convolutional layers is input to the 2 nd convolutional layer of the 3 convolutional layers, the feature map output by the 2 nd convolutional layer is input to the 3 rd convolutional layer of the 3 convolutional layers, the feature map output by the 3 rd convolutional layer is input to the detection module 1, and when a human face is taken as the target object, the detection module 1 is configured to detect a human face with a preset size. For example, in combination with the above example, in the case where anchor1 is equal to 8, anchor2 is equal to 16, and anchor3 is equal to 32, the receptive field of the feature map output by the detection module 1 is suitable for detecting a human face of a preset scale (the preset scale includes 8 × 8 pixels, 16 × 16 pixels, and 32 × 32 pixels), and the human face of 8 × 8 pixels, 16 × 16 pixels, and 32 × 32 pixels belongs to a target object of a smaller scale. That is, the receptive field of the feature map output by the detection module 1 corresponds to the scales of anchor1, anchor2, and anchor3, and the size of the receptive field of the feature map output by the detection module 1 is determined to be related to the scales of anchor1, anchor2, and anchor 3.

Note that the convolution layers 1 to 5 include 3 downsampling, the convolution layers 1 to 11 include 4 downsampling, and the convolution layers 1 to 13 include 5 downsampling. Therefore, the resolution of the feature map output by the convolutional layer 5 is greater than the resolution of the feature map output by the convolutional layer 11. Since the feature map output by the convolutional layer of the lower layer has a larger resolution, and the receptive field of the feature map output by the convolutional layer of the lower layer is suitable for detecting a target object with a smaller scale (the target object with a smaller scale is, for example, human face with a size similar to that of anchor1, anchor2, or anchor3), in order to improve the detection accuracy of the target object with a smaller scale in the image, in the embodiment of the present invention, the nth convolutional layer in the convolutional layer with a first preset number is connected to the convolutional layer with a second preset number, the convolutional layer with a second preset number is connected to the detection module for detecting the target object with a preset scale in the lightweight detection model, so that the target object with a smaller scale can be detected according to the feature map output by the convolutional layer of the lower layer of the lightweight base model, which is shown in fig. 2 using convolutional layer 5 (convolutional layer 5 belongs to the lower layer, i.e. N equals 5) detects a face of a smaller size (when the face is taken as a target object).

Meanwhile, because the expression capability of the semantic information (semantic information, namely, context information) of the feature map output by the convolutional layer 5 is insufficient, and in order to improve the detection accuracy of the target object corresponding to the scales of the anchor1, the anchor2 and the anchor3 under the condition that the feature map corresponds to the anchors 1, 2 and 3, the feature map sense field output by the convolutional layer 5 needs to be improved (the sense field output by the convolutional layer 5 can be calculated to be equal to 33 × 33 according to the sense field calculation formula provided by the prior art and the sizes and step lengths of the convolutional kernels from the convolutional layer 1 to the convolutional layer 5), so that the improved sense field (the improved sense field is the sense field output by the detection module 1) can be within a preset range, the preset range is set according to the scales of anchors 1, 2 and 3, for example, the maximum dimension of the anchors 1, 2, and 3 is 32, and the preset range may be set to be greater than 32 × 32 and less than 64 × 64. Meanwhile, the preset range cannot be set too large, so that the detection precision of a small target object is ensured. Therefore, in the embodiment of the present invention, the feature map output by the convolutional layer 5 (the size of the feature map is 72 × 36 × 32) is input to the 1 st convolutional layer in the lightweight detection model, and is subjected to convolution operation from the 1 st convolutional layer to the 3 rd convolutional layer (in order to balance detection accuracy and detection speed, the size of the convolution kernel is 3 × 3 from the 1 st convolutional layer to the 3 rd convolutional layer), so as to increase the size of the receptive field of the feature map output by the convolutional layer 5, and the feature map output after the convolution operation of the 3 rd convolutional layer (the size of the feature map output by the 3 rd convolutional layer is 72 × 36 × 128) is input to the detection module 1, so as to further enhance the semantic information and the receptive field of the feature map output by the convolutional layer 5 through the detection module 1, so that the receptive field of the feature map output by the detection module 1 is within a preset range, and is suitable for detection and detection of the achor 1, Human faces of the similar scales of anchors 2 and 3 (for convenience of description, human faces of similar sizes to the scales of anchors 1, 2, and 3 will be referred to as smaller-scale human faces hereinafter). By increasing the receptive field and semantic information of the feature map output by the convolutional layer 5, the detection accuracy of the face with a smaller scale can be further improved to a certain extent. The detection module 1 is an SSH detection module, and referring to fig. 2, taking the scale of each anchor and the size of the feature map shown in table 1 as examples, after the feature map output by the convolutional layer 5 is subjected to convolution operations of the 1 st to 3 rd convolutional layers and the detection module 1, the size of the feature map output by the detection module 1 is 72 × 36 × 128, and the receptive field of the feature map is 64 × 64. Referring to fig. 3 and fig. 4, fig. 3 is a schematic structural diagram of an SSH detection module provided in an embodiment of the present invention, and fig. 4 is a schematic structural diagram of a context module in the SSH detection module provided in the embodiment of the present invention. The convolutional layer in fig. 3 and fig. 4 may be a depth separable convolutional layer, the SSH detection Module includes a Context Module (Context Module), and the Context Module may be configured to add semantic information of the feature map, as shown in fig. 4, the convolutional layer shown in fig. 3 and fig. 4 may be a standard convolutional layer, or may be a part of the convolutional layer as a standard convolutional layer, and another part of the convolutional layer as a depth separable convolutional layer, and when the depth separable convolutional layer is used, the amount of parameter calculation may be reduced, and the detection model may be lightweight. The splicing operation in fig. 3 is to splice the feature map output by the context module and the feature map output by the convolutional layer in the channel direction, and the splicing operation in fig. 4 is to splice the feature map output by the convolutional layer 2 and the feature map output by the convolutional layer 4 in the channel direction.

Referring to fig. 2, the lightweight detection model may further include a detection module 2 and a detection module 3, where the detection module 2 includes an SSH detection module, and the detection module 3 also includes an SSH detection module. The feature map output by the detection module 1 may be input to the regression module 11 and the classification module 12, the feature map output by the detection module 2 may be input to the regression module 21 and the classification module 22, and the feature map output by the detection module 3 may be input to the regression module 31 and the classification module 32 (where the convolutional layers in the regression module and the classification module may be implemented using depth separable convolutional layers, so that the parameter calculation amount may be further reduced, and the detection model may be further lightweight). If the preset detection model is used for detecting a human face, under the condition that the preset anchor points comprise anchor1, anchor2, anchor3, anchor4, anchor5 and anchor6, the designed preset anchors 1, anchor2 and anchor3 correspond to the feature map output by the detection module 1, the anchor4 and the anchor5 correspond to the feature map output by the detection module 2, and the anchor6 corresponds to the feature map output by the detection module 3. Namely, the detection module 1 is used for detecting human faces with the scale of anchor1, anchor2 and anchor3, the detection module 2 is used for detecting human faces with the scale of anchor4 and anchor5 (hereinafter, human faces with the size approximate to the scale of anchor4 and anchor5 are referred to as mesoscale human faces), and the detection module 3 is used for detecting human faces with the scale of anchor6 (hereinafter, human faces with the size approximate to the scale of anchor4 and anchor5 are referred to as large-scale human faces). Referring to fig. 2, the feature map output from the convolutional layer 11 of the lightweight basic model is input to the detection module 2, and the feature map output from the convolutional layer 13 is input to the detection module 3.

In the prior art, an SSH model is used for target object detection, a basic network in the SSH model is a VVG network, the VVG network belongs to a large network, and the basic network in the VVG network includes dozens of convolutional layers, so that the calculation amount of parameters of the VVG network is large, and the target object detection speed is affected. In the embodiment of the invention, the lightweight basic model is used as the basic network in the preset detection model, and the parameter calculation amount of the lightweight basic model is less than that of the VGG network, so that the target object detection speed can be improved to a certain extent. Meanwhile, the preset detection model is constructed according to the scale of the preset anchor point, so that the receptive field and the context information of the characteristic diagram obtained by the constructed preset detection model correspond to the scale of the preset anchor point, and the characteristic diagram is suitable for detecting the target object with the size similar to the scale of the preset anchor point.

It should be noted that, depth-separable convolutional layers are used as convolutional layers 2 to 13, the parameter calculation amount of the depth-separable convolutional layers is smaller than that of the standard convolutional layers, and the calculation amount of the depth-separable convolutional layers can be reduced by 8 to 9 times compared with that of the standard convolutional layers, so that the parameter calculation amount is reduced to some extent compared with that of the standard convolutional layers used in the VVG basic network in the prior art, and therefore, the target object detection speed can be increased to some extent. And moreover, a preset detection model is constructed according to the dimension of the preset anchor point, so that the constructed preset detection model is suitable for detecting the target object with the dimension similar to that of the preset anchor point, and the detection precision of the target object can be ensured. Meanwhile, the second preset number of convolutional layers is connected with a detection module for detecting a target object with a preset scale in the lightweight detection model, namely, the receptive field and semantic information of the feature map input into the detection module 1 are increased, and the detection precision of the smaller target object is further improved.

And training the preset detection model by adopting the target image sample to obtain a target detection model.

The target detection model comprises a target lightweight basic model and a target lightweight detection model.

The preset detection model is trained by adopting the target image sample, so that the target detection model can be obtained by the following method:

the first method is as follows: and under the condition that the target image sample comprises an image sample with a resolution, training a preset detection model by adopting the target image sample to obtain a target detection model.

The second method comprises the following steps: under the condition that the target image samples comprise image samples with various resolutions, training the preset detection model by adopting a first target image sample to obtain a first detection model, wherein the first target image sample is an image sample with any one resolution in the target image samples;

training the first detection model by adopting a second target image sample to obtain a second detection model, wherein the second target image sample is an image sample with any resolution in image samples except the first target image sample in the target image sample;

taking the second detection model as the target detection model in a case where there is no image sample of a resolution other than the first target image sample and the second target image sample in the target image samples;

and under the condition that the image samples with other resolutions exist in the target image sample, obtaining the target detection model according to the second detection model and the image samples with other resolutions.

For example, when the target image sample includes image samples of two resolutions, namely, the first image sample and the second image sample, the preset detection model is trained by using the first target image sample to obtain a first detection model, where the first target image sample is an image sample of any one resolution in the target image samples; and training the first detection model by adopting a second target image sample to obtain a second detection model, wherein the second target image sample is an image sample except the first target image sample in the target image samples. In this case, the second detection model is used as the target detection model because no image sample with other resolution than the first target image sample and the second target image sample exists in the target image samples.

And obtaining the target detection model according to the second detection model and the image samples with other resolutions under the condition that the image samples with other resolutions exist in the target image sample. For example, when a target image sample includes the first image sample, the second image sample, and the third image sample, training the preset detection model using the first target image sample to obtain a first detection model, where the first target image sample is an image sample with any resolution in the target image samples; training the first detection model by adopting a second target image sample to obtain a second detection model, wherein the second target image sample is an image sample with any resolution in image samples except the first target image sample in the target image sample; and training the second detection model by adopting a third target image sample to obtain the target detection model, wherein the third target image sample is an image sample except the first target image sample and the second target image sample in the target image samples.

It should be noted that, in the second mode, the first detection model is trained by using the second target image sample, and the training process for obtaining the target detection model is fine tuning (fine tune), where the fine tune is based on the trained model and then continues training by using image samples with other resolutions, so as to obtain a new model (for example, the new model is obtained by training the first detection model by using the second target image sample). Or, training the second detection model by adopting a third target image sample, wherein the training process of obtaining the target detection model belongs to fine tune. The fine tune has the advantages that the model is not required to be completely retrained, so that the training efficiency is improved, and the fine tune can enable the detection performance of the obtained model after a smaller number of iterations to be higher, wherein the detection performance comprises detection precision and detection speed.

When the preset detection model is trained by adopting the image samples with various resolutions, the preset detection model can be trained by adopting the image sample with the highest resolution in the various image samples to obtain a detection model A, and then the detection model A is trained by adopting the image sample with the high sub-resolution to obtain a detection model B until the training of the various image samples is finished. For example, when the target image sample includes a first image sample, a second image sample, and a third image sample, the first image sample is used to train a preset detection model to obtain a detection model a, the second image sample is used to train the detection model a to obtain a detection model B, and the third image sample is used to train the detection model B to obtain the target detection model. Or the preset detection model can be trained by adopting the second image sample to obtain the detection model 1, then the detection model 1 is trained by adopting the first image sample to obtain the detection model 2 after the convergence, and then the detection model 2 is trained by adopting the third image sample to obtain the target detection model after the convergence. In the embodiment of the present invention, the order of training the image samples with various resolutions is not particularly limited.

It should be noted that, when the preset detection model is trained by using the image sample with the highest resolution among the multiple image samples, the accuracy of the detection model a trained to be converged is higher, and when the subsequent detection model a is trained again by using the image sample with the highest resolution (fine tune), the accuracy of the obtained detection model B is also higher, and finally, the third image sample is used to perform the fine tune on the detection model B. The resolution ratio of the third image sample is lower, so that the detection speed is higher, the detection speed of the finally obtained target detection model is further improved, and meanwhile, the detection precision of the finally obtained detection model can be ensured to be higher.

Optionally, before the step 102 of training a preset detection model by using the target image sample to obtain a target detection model, the method may further include the following steps:

obtaining a target classification model, wherein the target classification model is obtained by training a classification model by adopting a preset image sample;

and initializing a lightweight basic model according to the target classification model.

The preset image sample is a sample in an ImageNet image data set, and the ImageNet is the image data set organized according to a WordNet hierarchical structure.

Correspondingly, step 103, training a preset detection model by using the target image sample to obtain a target detection model can be realized by the following steps:

and training a preset detection model comprising the initialized lightweight basic model by adopting the target image sample to obtain the target detection model.

The target image sample is adopted to train a preset detection model comprising an initialized lightweight basic model, so that the target detection model can be obtained in the following way:

the first method is as follows: and under the condition that the target image sample comprises an image sample with a resolution, training (fine tune) the target image sample by adopting a preset detection model comprising the initialized lightweight basic model to obtain a target detection model.

The second method comprises the following steps: under the condition that a target image sample comprises image samples with various resolutions, training a preset detection model comprising an initialized lightweight basic model by adopting a fourth target image sample to obtain a third detection model, wherein the fourth target image sample is an image sample with any one resolution in the target image sample;

training the third detection model by adopting a fifth target image sample to obtain a fourth detection model, wherein the fifth target image sample is an image sample with any resolution in image samples except the fourth target image sample in the target image samples;

taking the fourth detection model as the target detection model in a case where there is no image sample of a resolution other than the fourth target image sample and the fifth target image sample in the target image samples;

and under the condition that the image samples with other resolutions exist in the target image sample, obtaining the target detection model according to the fourth detection model and the image samples with other resolutions.

For example, when the target image sample includes image samples of two resolutions, that is, the first image sample and the second image sample, a fourth target image sample is used to train a preset detection model including an initialized lightweight basic model, so as to obtain a third detection model, where the fourth target image sample is an image sample of any one resolution in the target image samples, for example, the first image sample is used as the fourth target image sample;

and training the third detection model by using a fifth target image sample to obtain a fourth detection model, wherein the fifth target image sample is an image sample except for the fourth target image sample in the target image samples, and for example, the second image sample is used as the fifth target image sample. In this case, the fourth detection model is used as the target detection model because no image sample with a resolution other than the fourth target image sample and the fifth target image sample exists in the target image samples.

And obtaining the target detection model according to the fourth detection model and the image samples with other resolutions under the condition that the image samples with other resolutions exist in the target image sample. For example, when the target image sample includes a first image sample, a second image sample, and a third image sample (in this case, the target image sample includes image samples of the other resolutions), a fourth target image sample is used to train a preset detection model including the initialized lightweight basic model, so as to obtain a third detection model, where the fourth target image sample is an image sample of any one resolution in the target image samples;

training a third detection model by adopting a fifth target image sample to obtain a fourth detection model, wherein the fifth target image sample is an image sample with any resolution in image samples except the fourth target image sample in the target image sample;

and training a fourth detection model by using a sixth target image sample (the sixth target image sample is an image sample with other resolutions except for the fourth target image sample and the fifth target image sample existing in the target sample), so as to obtain a target detection model, namely obtaining the target detection model according to the second detection model and the image samples with other resolutions. And the sixth target image sample is an image sample except the fourth target image sample and the fifth target image sample in the target image samples. For example, the first image sample is taken as the fourth target image sample, the second image sample is taken as the fifth target image sample, and the third image sample is taken as the sixth target image sample.

By adopting the target image sample to train the preset detection model comprising the initialized lightweight basic model, namely training (fine tune) the preset detection model on the basis of the trained lightweight basic model, the detection precision of the target detection model can be further improved.

The above describes an overall process of training a model according to an embodiment of the present invention, and the following describes in detail a process of training a preset detection model by using a certain image sample, where the preset detection model is trained by using a first image sample to obtain a detection model a after convergence, the preset detection model may be a preset detection model initially constructed, or may be a preset detection model including an initialized lightweight basic model, and the step of training the preset detection model by using the first image sample includes:

generating a plurality of sample rectangular frames corresponding to the scale of the preset anchor point according to the first image sample;

inputting the first image sample into a preset detection model to obtain a category confidence coefficient and a position corresponding to each sample rectangular frame, wherein the category confidence coefficient is used for representing the probability that a target object in the sample rectangular frame belongs to a category corresponding to the target object;

calculating a loss value according to the class confidence and the position corresponding to each sample rectangular frame and a sample label corresponding to a target rectangular frame in the first image sample, wherein the sample label is used for representing the position of the target rectangular frame and the class corresponding to a target object in the target rectangular frame;

and adjusting parameters of the preset detection model according to the loss value, inputting the first image sample to the adjusted preset detection model again until the loss value obtained by calculation is not reduced any more, and taking the finally obtained preset detection model as a detection model A.

In the embodiment of the present invention, the target object in the first image sample corresponds to tag information, where the tag information may include coordinates of a target area of the target object in the first image sample, a category corresponding to each target object, and the like, for example, if it is required that the target detection model can identify a face and a position of the face in the image, the target is the face, the target area is an area where the face is located, the category corresponding to the target object is the face, and the tag information may include coordinates of the area where the face is located and a category corresponding to the face. The coordinates of the target area include, for example, the coordinates of the upper left corner and the lower right corner of the target area. If the target detection model is required to be capable of identifying the position of the vehicle at the same time, the target object further comprises the vehicle, the target area is the area where the vehicle is located, and the category corresponding to the target object is the vehicle.

And generating a plurality of sample rectangular frames corresponding to the scale of the preset anchor point according to the first image sample. For example, when the preset anchor includes the above-described 6 anchors, a plurality of sample rectangular frames 1 corresponding to the scale of the anchor1 may be generated, a plurality of sample rectangular frames 2 corresponding to the scale of the anchor2 may be generated, and so on, a plurality of sample rectangular frames 6 corresponding to the scale of the anchor6 may be generated.

Inputting the first image sample into a preset detection model, obtaining a class confidence and a position corresponding to each sample rectangular frame and a sample label corresponding to a target rectangular frame in the first image sample, and calculating a loss value, wherein the loss value comprises a classification loss value and a regression loss value, and the respective weights of the classification loss value and the regression loss value can be set according to requirements. For example, if the detection accuracy of the trained preset detection model for the position of the target object needs to be higher, the weight corresponding to the set regression loss may be greater than the weight corresponding to the classification loss value. The target rectangular frame is a marked rectangular frame which is marked in the first image sample and comprises the target object.

It should be noted that, a positive sample and a negative sample corresponding to each target rectangular frame in the first image sample may be selected from all the sample rectangular frames, a regression loss value is calculated according to a position corresponding to the positive sample, a classification loss value is calculated according to category confidences corresponding to the positive sample and the negative sample, and a final loss value is calculated according to the regression loss value, the classification loss value, and weights corresponding to the regression loss value and the classification loss value, respectively. An example of how to select the positive and negative samples corresponding to each target rectangular box in the first image sample from all sample rectangular boxes is described here: if the target rectangular frame includes the target rectangular frame 1 as an example, the number of all the sample rectangular frames is 1000, for example, and the process of selecting the positive sample and the negative sample corresponding to the target rectangular frame 1 from the 1000 sample rectangular frames is as follows: and calculating the Intersection ratio (IOU) of each sample rectangular frame and the target rectangular frame 1, wherein the Intersection ratio is the ratio of the Intersection and the Union of the sample rectangular frame and the target rectangular frame. According to the corresponding position of each sample rectangular frame and the position of the target rectangular frame 1, the intersection ratio of each sample rectangular frame and the target rectangular frame 1 can be calculated; selecting a maximum cross-over ratio from each cross-over ratio, if the maximum cross-over ratio is greater than or equal to a first preset threshold (the first preset threshold is equal to 0.5 for example), taking the corresponding sample rectangular frame of the maximum cross-over ratio as a positive sample, randomly selecting a preset number of cross-over ratios from the cross-over ratios of which the cross-over ratios are less than a second preset threshold (the second preset threshold is equal to 0.2 for example), and taking the corresponding sample rectangular frames of the selected cross-over ratios as negative samples; and if the maximum intersection ratio is smaller than a first preset threshold value, randomly selecting negative samples from all sample rectangular frames. The negative samples are selected from a rectangular box of samples having an intersection ratio smaller than a second predetermined threshold (e.g., 0.2). And if the ratio of the selected number of the positive samples to the selected number of the negative samples is 1:3, selecting 3 sample rectangular frames from the sample rectangular frames with the intersection ratio smaller than 0.2 as the negative samples. If the target rectangular frame further includes the target rectangular frame 2, the positive sample and the negative sample corresponding to the target rectangular frame 2 may be selected according to the above process.

Because IOU calculation is carried out between each sample rectangular frame and the target rectangular frame, the largest ordered result in the IOU result is regarded as the target sample rectangular frame corresponding to the target rectangular frame, regression is carried out according to the target sample rectangular frame in future, the position of the target sample rectangular frame output by the finally trained detection model approaches to the target rectangular frame, and therefore a target sample rectangular frame is ensured to correspond to one target rectangular frame, each target rectangular frame corresponds to one target sample rectangular frame, the preset detection model can more easily regress each target rectangular frame, the preset detection model is better in convergence, and the obtained detection model A is better in convergence. On the basis of the trained detection model A, the training process of fine tune is carried out, so that the convergence of the finally obtained target detection model is better, and the detection precision of the target object can be further improved.

Optionally, referring to fig. 5, fig. 5 is a flowchart illustrating steps of an image sample obtaining method according to an embodiment of the present invention. Obtaining a target image sample by:

and step 501, obtaining an original image.

Step 502, judging whether the original image needs to be cut.

In case of needing to cut the original image, executing step 503; in the case where no cropping of the original image is required, step 504 is performed.

And 503, determining a central point in the target area, and determining a cutting image by taking the central point as the center.

The width of the cut image is equal to any multiple within a preset multiple range of the width of the original image, the height of the cut image is equal to any multiple within a preset multiple range of the height of the original image, the target area is the residual area except the boundary area in the area of the first original target image, and the boundary area is the area which is away from the peripheral boundary of the first original target image and has a preset size.

For example, the preset numerical range is set to be equal to or greater than 0.3 and less than 1. If the size of the original image is 1920 × 1080, the width of the cropped image may be equal to 0.3 times the width of the original image, and the height of the cropped image may be equal to 0.4 times the height of the original image, then the size of the cropped image may be (1920 × 0.3) × (1080 × 0.4); alternatively, the width of the clip image may be equal to 0.5 times the width of the original image, and the height of the clip image may be equal to 0.6 times the height of the original image, and the size of the clip image may be (1920 × 0.5) × (1080 × 0.6). The width of the cut image is equal to any multiple in the range of preset multiples of the width of the original image, and the height of the cut image is equal to any multiple in the range of preset multiples of the height of the original image, so that the width and the height of the cut image obtained for multiple times are diversified, and finally obtained image samples are diversified.

Step 505 is executed after step 503 is executed.

Step 504, judge whether need to carry on the color conversion to the primitive picture.

In case of color conversion of the original image is required, step 508 is performed; in case no color conversion of the original image is required, step 509 is performed.

And step 505, judging whether color conversion needs to be carried out on the cut image.

In case that color conversion is required for the cut-out image, step 506 is executed; in the case where color conversion of the cropped image is not required, step 507 is performed.

And step 506, performing color conversion on the cut image, and judging whether to adjust the size of the cut image after the color conversion according to the size of the cut image after the color conversion so as to obtain a target image sample.

And 507, judging whether the size of the cut image is adjusted or not according to the size of the cut image to obtain a target image sample.

And step 508, performing color conversion on the original image, and judging whether to adjust the size of the cut image after the color conversion according to the size of the original image after the color conversion so as to obtain a target image sample.

Step 509, determining whether to adjust the size of the original image according to the size of the original image to obtain a target image sample.

It should be noted that, whether to crop or color convert the original image is determined according to the random value, for example, when determining whether to crop the original image, if the random value is an odd number, the original image is cropped; if the random value is even, the original image is not cropped.

Color converting an image (the image is an original image or a cropped image) may include color converting and channel-sequential converting the image. For example, the channel sequence conversion includes converting the channel sequence from RGB to BGR or GRB, in which three colors of red (R), green (G), and blue (B), i.e., RGB color patterns. Whether color conversion or channel order conversion is performed is also random. Through the steps, each obtained image sample can be diversified, the diversified image samples are adopted to train the preset detection model, and the detection precision of the finally obtained target detection model is improved.

Based on steps 501 to 509, the target image sample includes the first image sample, that is, the first image sample needs to be obtained, that is, the image sample with a resolution size of 1920 × 1080 is taken as an example for description. For example, the resolution of the obtained original image 1 is 1920 × 1080, and if the original image 1 is not subjected to clipping and color conversion after the above steps, the original image 1 may be taken as the first image sample. If the resolution of the obtained original image 2 is 1280 × 720, and the original image 2 is not cropped and color-converted after the above steps, the size of the original image may be adjusted (Resize) to enlarge the original image to a 1920 × 1080 image, thereby obtaining a first image sample. Similarly, if the original image 1 is cropped and color-converted to obtain a color-converted cropped image, the cropped image may be adjusted to obtain the first image sample (i.e., the target image sample) with a resolution of 1920 × 1080.

If a first image sample and a second image sample need to be obtained, that is, the target image sample comprises the first image sample and the second image sample, if the original image 1 is not subjected to cropping and color conversion, the original image 1 can be used as the first image sample; then, the size of the original image 1 is adjusted to 1280 × 720, and a second image sample is obtained. If the target image sample comprises a first image sample, a second image sample and a third image sample, if the original image 1 is not cropped and color-converted, the original image 1 can be taken as the first image sample; adjusting the size of the original image 1, and adjusting the size of the original image 1 to 1280 × 720 to obtain a second image sample; the size of the original image 1 is adjusted to 576 × 288, and a third image sample is obtained. The target image sample can be obtained according to the steps 501 to 509.

It should be noted that, if the target image sample includes the first image sample and the second image sample, after the first image sample is obtained, the above steps 501 to 509 may be repeatedly performed to obtain the second image sample, that is, the image sample with the resolution size of 1280 × 720 is obtained. For example, if the resolution of the obtained original image 1 is 1920 × 1080, and if the original image 1 is not cropped and color-converted after the above steps, it may be determined that the size of the original image 1 does not need to be adjusted according to the size of the original image 1 to obtain the first image sample. After the first image sample is obtained, a frame of original image (the original image may be the original image 1, or may be another original image) may be obtained, the above steps 501 to 509 are repeatedly performed to determine whether the frame of original image needs to be cropped and color-converted, and if the frame of original image is 576 × 288 and the frame of original image is not cropped and color-converted, the size of the frame of original image may be adjusted to obtain an image with a size of 1280 × 720, which is used as the second image sample. Due to space limitations, the process of obtaining the target image sample is not described in detail. Each of the obtained target image samples can be made more diverse by repeatedly performing the above-described steps 501 to 509.

It should be noted that, if the original image is cropped, for example, the original image includes 10 target regions, and the cropped image obtained after the cropping includes 5 complete target regions and a remaining region with a part of the region cropped, the ratio between the remaining region and the target region before the part of the region is not cropped may be determined, and if the ratio is smaller than a preset threshold (for example, 0.8), the remaining region is not used as the target rectangular frame. For example, if a remaining region in the clipped image is a region including only the chin of a human being, and the ratio of the remaining region to a target region before clipping is less than 0.8, the remaining region is not taken as a target rectangular frame, that is, the IOU of each sample rectangular frame and the target region is not calculated, so that a positive sample is not selected from each sample rectangular frame, and the difficulty of learning the detection model is increased when an excessively small region is taken as the target region, that is, the excessively small region is not taken as the target region, so that the detection model can be more easily regressed.

Referring to fig. 6, fig. 6 is a flowchart illustrating steps of a target detection method provided in an embodiment of the present invention, where the method may be applied to a terminal or a server, and the method includes:

step 601, obtaining an image to be detected, and generating a rectangular frame corresponding to the scale of the preset anchor point according to the image to be detected.

Step 602, inputting an image to be detected into a target lightweight basic model to obtain a basic feature map corresponding to a preset anchor point.

The target lightweight base model is a lightweight base model included in the target detection model provided in the above embodiment. Referring to fig. 2, the basic feature map includes, for example, a basic feature map 1 output by the convolutional layer 5 of the target lightweight base model, a basic feature map 2 output by the convolutional layer 11, and a basic feature map 3 output by the convolutional layer 13.

In connection with the illustration in the above embodiment, in the case where the preset anchor points include anchors 1 to 6, basic feature maps 1 corresponding to anchors 1, 2, and 3 are obtained, basic feature maps 2 corresponding to anchors 4 and 5 are obtained, and basic feature maps 3 corresponding to anchors 6 are obtained.

Step 603, inputting the basic feature map into the target lightweight detection model to obtain a detection feature map corresponding to the basic feature map, obtaining a category confidence and a position corresponding to each rectangular frame according to the detection feature map, and obtaining a target detection result of a target object in the image to be detected according to the category confidence and the position corresponding to each rectangular frame.

The SSH detection module in the detection module 1 can obtain a detection feature map 1 corresponding to the basic feature map 1, the SSH detection module in the detection module 2 can obtain a detection feature map 2 corresponding to the basic feature map 2, and the SSH detection module in the detection module 3 can obtain a detection feature map 3 corresponding to the basic feature map 3. The category confidence and the position corresponding to the rectangular frame with the first scale can be obtained according to the detection feature diagram 1, wherein the rectangular frame with the first scale comprises rectangular frames corresponding to the scales of anchor1, anchor2 and anchor3 respectively; the category confidence and the position corresponding to the rectangular frame of the second scale can be obtained according to the detection feature diagram 2, wherein the rectangular frame of the second scale comprises rectangular frames corresponding to the scales of anchorms 4 and anchorms 5 respectively; according to the detection features, the category confidence and the position corresponding to the rectangular box of the third scale can be obtained through the graph 3, and the rectangular box of the third scale comprises the rectangular box corresponding to the scale of the anchor 6. By using the detection feature map 1, the detection feature map 2 and the detection feature map 3, the category confidence and the position corresponding to each rectangular frame can be obtained.

And obtaining a target detection result of the target object in the image to be detected according to the category confidence coefficient and the position corresponding to each rectangular frame. A Non-maximum suppression (NMS) algorithm may be used to obtain a target detection result of the target object in the image to be detected, where the target detection result includes a category confidence and a position of the target object. The category confidence is used for representing the probability that the target object belongs to the category corresponding to the target object. For example, for an image to be detected including a large-scale face, a medium-scale face and a small-scale face, a first-scale rectangular frame, a second-scale rectangular frame and a third-scale rectangular frame are obtained, confidence degrees of all the scales of the rectangular frames are sorted, and a target detection result of a target object is determined according to the sorting result.

The confidence levels of all the rectangular boxes are sorted (in ascending or descending order) for NMS illustration. Taking the example that the set H includes a rectangular frame of the first scale, a rectangular frame of the second scale, and a rectangular frame of the third scale, the initialized set M is an empty set.

Sorting all the rectangle frames of all the scales in the set H, finding a rectangle frame a corresponding to the highest confidence, moving the rectangle frame a into the set M, then including the rectangle frames except the rectangle frame a in the rectangle frames of all the scales in the current set H, calculating the intersection ratio of each rectangle frame in the current set H and the rectangle frame a, if the intersection ratio of a certain rectangle frame and the rectangle frame a is greater than or equal to a preset threshold (for example, the preset threshold is set to be 0.5), then considering that the rectangle frame overlaps the rectangle frame a, and removing the rectangle frame from the set H. And then sorting all the rectangle frames with the sizes in the set H, and then finding out the rectangle frame corresponding to the highest confidence coefficient to iterate until the set H is empty, wherein the category confidence coefficient and the position corresponding to the rectangle frame in the set M are the finally obtained target detection result of the target object in the image to be detected.

The target detection method provided by this embodiment obtains an image to be detected, generates a rectangular frame corresponding to a scale of a preset anchor point according to the image to be detected, inputs the image to be detected into a target lightweight basic model to obtain a basic feature map corresponding to the preset anchor point, inputs the basic feature map into the target lightweight detection model to obtain a detection feature map corresponding to the basic feature map, obtains a category confidence and a position corresponding to each rectangular frame according to the detection feature map, and obtains a target detection result of a target object in the image to be detected according to the category confidence and the position corresponding to each rectangular frame. Because the light-weight basic model is adopted, the related parameter calculation amount is small, and the detection speed of the target object can be improved to a certain extent. Meanwhile, the preset detection model is constructed according to the scale of the preset anchor point, so that the constructed preset detection model is suitable for detecting the target object with the size similar to that of the preset anchor point, and the target detection model is obtained by training the constructed preset detection model, so that the obtained target detection model is suitable for detecting the target object with the size similar to that of the preset anchor point, and the detection precision of the target object is ensured to a certain extent.

Optionally, the preset anchor points include at least one of a first preset anchor point (the first preset anchor point includes, for example, anchor1, anchor2 and anchor3), a second preset anchor point (the second preset anchor point includes, for example, anchor4 and anchor5), and a third preset anchor point (the third preset anchor point includes, for example, anchor6), and the size of the second preset anchor point is greater than that of the first preset anchor point and smaller than that of the third preset anchor point;

the basic characteristic diagram comprises: at least one of a first basic feature map (e.g., basic feature map 1) corresponding to a first preset anchor point, a second basic feature map (e.g., basic feature map 2) corresponding to a second preset anchor point, and a third basic feature map (e.g., basic feature map 3) corresponding to a third preset anchor point;

the detection characteristic diagram comprises the following steps: at least one of a first target detection feature map corresponding to the first basic feature map (the first target detection feature map is, for example, a feature map output by an SSH detection module of the detection module 1), a second target detection feature map corresponding to the second basic feature map (the second target detection feature map is, for example, a feature map output by an SSH detection module of the detection module 2), and a third target detection feature map corresponding to the third basic feature map (the third target detection feature map is, for example, a feature map output by an SSH detection module of the detection module 3), where a receptive field of the first target detection feature map is greater than a maximum resolution of the first preset anchor point, a receptive field of the second target detection feature map is greater than a maximum resolution of the second preset anchor point, and a receptive field of the third target detection feature map is greater than a maximum resolution of the third preset anchor point;

the category confidence and the position corresponding to all the rectangular boxes comprise: a category confidence and position corresponding to at least one of the category confidence and position corresponding to the first rectangular box, the category confidence and position corresponding to the second rectangular box, and the category confidence and position corresponding to the third rectangular box; the first rectangular frame is a rectangular frame corresponding to the first preset anchor point, the second rectangular frame is a rectangular frame corresponding to the second preset anchor point, and the third rectangular frame is a rectangular frame corresponding to the third preset anchor point.

Optionally, the first basic feature map is obtained by a first preset number of first convolution layers in the target lightweight basic model according to the image to be detected, the second basic feature map is obtained by a second preset number of second convolution layers in the target lightweight basic model according to the first basic feature map, and the third basic feature map is obtained by a third preset number of third convolution layers in the target lightweight basic model according to the second basic feature map.

The first predetermined number of first convolutional layers is, for example, 5 convolutional layers from convolutional layer 1 to convolutional layer 5 in fig. 2. The second predetermined number of second convolutional layers is, for example, 6 convolutional layers from 6 to 11 in fig. 2, and the third predetermined number of third convolutional layers is, for example, 2 convolutional layers from 12 to 13 in fig. 2.

Optionally, the first target detection feature map is obtained by performing convolution operation according to the first basic feature map through a first detection module in the target lightweight detection model, or the first target detection feature map is obtained by fusing a second target detection feature map or a third target detection feature map with the first detection feature map;

the second target detection feature map is obtained by performing convolution operation on a second detection module in the target lightweight detection model according to a second basic feature map, or is obtained by fusing the second detection feature map and a third target detection feature map.

Among them, the SSH detection module in the detection module 1 (first detection module) outputs a first detection feature map (detection feature map 1), the SSH detection module in the detection module 2 (second detection module) outputs a second detection feature map (detection feature map 2), and the SSH detection module in the detection module 3 (third detection module) outputs a third detection feature map (detection feature map 3). In order to further improve the detection accuracy of the target with the intermediate scale, the detection feature map 3 and the detection feature map 2 may be fused to obtain a fused detection feature map, the fused detection feature map is used as a second target detection feature map, the second target detection feature map is input to the regression module 21 and the classification module 22, the regression module 21 and the classification module 22 calculate category confidence degrees and positions respectively corresponding to the rectangular frames with the second scale based on the fused detection feature map, and the fused detection feature map contains richer semantic information, so that the detection accuracy of the target object with the intermediate scale may be further improved.

In order to improve the detection accuracy of the target object with a smaller scale, the fused detection feature map and the detection feature map 1 may be fused to obtain a fused detection feature map, where the fused detection feature map includes richer semantic information, the fused detection feature map is used as a first target detection feature map, the first target detection feature map is input to the regression module 11 and the classification module 12, and the regression module 11 and the classification module 12 calculate, based on the first target detection feature map, category confidence degrees and positions respectively corresponding to rectangular frames with a first scale, so that the detection accuracy of the target object with a smaller scale may be further improved. The detection feature map 2 or the detection feature map 3 can also be directly fused with the detection feature map 1 to improve the detection accuracy of the small target object.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a model training apparatus provided in an embodiment of the present invention, where the apparatus 700 includes:

a first obtaining module 710 for obtaining a target image sample; wherein the target image sample comprises an image sample of at least one resolution;

a training module 720, configured to train a preset detection model with the target image sample to obtain a target detection model; the preset detection model is built according to the scale of a preset anchor point, the preset detection model comprises a light-weight basic model and a light-weight detection model connected with the light-weight basic model, the target detection model comprises a target light-weight basic model and a target light-weight detection model, the target light-weight basic model is used for obtaining a feature map of the target image sample, and the target light-weight detection model is used for obtaining a target detection result of a target object in the target image sample according to the feature map.

Optionally, the lightweight base model includes a model composed of a first preset number of convolutional layers in a mobile network model, the lightweight detection model includes a detection model except for a base network in a single-stage headless SSH model and a second preset number of convolutional layers, an nth convolutional layer in the first preset number of convolutional layers is connected to the second preset number of convolutional layers, the second preset number of convolutional layers is connected to a detection module in the lightweight detection model, which is used for detecting a target object with a preset scale, and N is smaller than the first preset number.

Optionally, the training module 720 is specifically configured to train the preset detection model by using a first target image sample to obtain a first detection model when the target image sample includes image samples of multiple resolutions, where the first target image sample is an image sample of any one resolution in the target image samples;

Optionally, referring to fig. 8, fig. 8 is a schematic structural diagram of another model training device provided in an embodiment of the present invention, where the device 800 may further include:

a second obtaining module 810, configured to obtain a target classification model, where the target classification model is obtained by training a classification model using a preset image sample;

an initializing module 820, configured to initialize the lightweight base model according to the target classification model to obtain an initialized lightweight base model;

the training module 720 is specifically configured to train a preset detection model including the initialized lightweight basic model by using the target image sample to obtain the target detection model.

Optionally, the first obtaining module 710 is specifically configured to obtain the original image;

determining a central point in a target area under the condition that the original image needs to be cut, and determining a cut image by taking the central point as a center; the width of the cut image is equal to any multiple within a preset multiple range of the width of the original image, the height of the cut image is equal to any multiple within the preset multiple range of the height of the original image, the target area is the remaining area of the first original target image except for the boundary area, and the boundary area is an area which is away from the peripheral boundary of the first original target image and has a preset size;

and under the condition that the color conversion needs to be carried out on the cutting image, carrying out the color conversion on the cutting image, and judging whether the size of the cutting image after the color conversion is adjusted or not according to the size of the cutting image after the color conversion so as to obtain the target image sample.

Optionally, the first obtaining module 710 is further configured to determine whether color conversion is required to be performed on the original image under the condition that the original image does not need to be cropped; and if the original image needs to be subjected to color conversion, performing color conversion on the original image, and judging whether the size of the original image subjected to color conversion is adjusted or not according to the size of the original image subjected to color conversion so as to obtain any one of the first image sample, the second image sample and the third image sample.

Optionally, the first obtaining module 710 is further configured to, in a case that color conversion is not required to be performed on the cropped image, determine whether to adjust the size of the cropped image after the color conversion according to the size of the cropped image, so as to obtain any one of the first image sample, the second image sample, and the third image sample.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an object detection apparatus provided in an embodiment of the present invention, where the apparatus 900 includes:

a first obtaining module 910, configured to obtain an image to be detected, and generate a rectangular frame corresponding to a scale of a preset anchor point according to the image to be detected;

a second obtaining module 920, configured to input the image to be detected into a target lightweight base model to obtain a basic feature map corresponding to the preset anchor point;

a third obtaining module 930, configured to input the basic feature map into the target lightweight detection model to obtain a detection feature map corresponding to the basic feature map, obtain a category confidence and a position corresponding to each rectangular frame according to the detection feature map, and obtain a target detection result of the target object in the image to be detected according to the category confidence and the position corresponding to each rectangular frame.

Optionally, the preset anchor point includes at least one of a first preset anchor point, a second preset anchor point and a third preset anchor point, and a scale of the second preset anchor point is larger than a scale of the first preset anchor point and smaller than a scale of the third preset anchor point;

the basic feature map includes: at least one of a first basic feature map corresponding to the first preset anchor point, a second basic feature map corresponding to the second preset anchor point, and a third basic feature map corresponding to the third preset anchor point;

the detection characteristic diagram comprises: at least one of a first target detection feature map corresponding to the first basic feature map, a second target detection feature map corresponding to the second basic feature map, and a third target detection feature map corresponding to the third basic feature map, wherein a receptive field of the first target detection feature map is greater than a maximum resolution of the first preset anchor point, a receptive field of the second target detection feature map is greater than a maximum resolution of the second preset anchor point, and a receptive field of the third target detection feature map is greater than a maximum resolution of the third preset anchor point;

Optionally, the first basic feature map is obtained by a first preset number of first convolutional layers in the target lightweight basic model according to the image to be detected, the second basic feature map is obtained by a second preset number of second convolutional layers in the target lightweight basic model according to the first basic feature map, and the third basic feature map is obtained by a third preset number of third convolutional layers in the target lightweight basic model according to the second basic feature map.

Optionally, the first target detection feature map is a first detection feature map obtained by performing convolution operation on the first basic feature map through a first detection module in the target lightweight detection model, or the first target detection feature map is a feature map obtained by fusing the first detection feature map with the second target detection feature map or the third target detection feature map;

the second target detection feature map is obtained by performing convolution operation on a second detection module in the target lightweight detection model according to the second basic feature map, or is obtained by fusing the second detection feature map and the third target detection feature map;

and the third target detection feature map is a feature map obtained by performing convolution operation on a third detection module in the target lightweight detection model according to the third basic feature map.

An embodiment of the present invention further provides an electronic device, as shown in fig. 10, and fig. 10 is a schematic structural diagram of the electronic device provided in the embodiment of the present invention. Comprises a processor 1001, a communication interface 1002, a memory 1003 and a communication bus 1004, wherein the processor 1001, the communication interface 1002 and the memory 1003 are communicated with each other through the communication bus 1004,

a memory 1003 for storing a computer program;

the processor 1001 is configured to implement the following steps when executing the program stored in the memory 1003:

training a preset detection model by using the target image sample to obtain a target detection model; the preset detection model is built according to the scale of a preset anchor point, the preset detection model comprises a light-weight basic model and a light-weight detection model connected with the light-weight basic model, the target detection model comprises a target light-weight basic model and a target light-weight detection model, the target light-weight basic model is used for obtaining a feature map of the target image sample, and the target light-weight detection model is used for obtaining a target detection result of a target object in the target image sample according to the feature map. Alternatively, the first and second electrodes may be,

acquiring an image to be detected, and generating a rectangular frame corresponding to the scale of a preset anchor point according to the image to be detected; inputting the image to be detected into the target lightweight basic model to obtain a basic characteristic diagram corresponding to the preset anchor point; and inputting the basic feature map into the target lightweight detection model to obtain a detection feature map corresponding to the basic feature map, obtaining a category confidence coefficient and a position corresponding to each rectangular frame according to the detection feature map, and obtaining a target detection result of a target object in the image to be detected according to the category confidence coefficient and the position corresponding to each rectangular frame.

The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the terminal and other equipment.

The Memory may include a Random Access Memory (RAM), or may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, or discrete hardware components.

In yet another embodiment of the present invention, a computer-readable storage medium is further provided, which has instructions stored therein, and when the instructions are executed on a computer, the instructions cause the computer to perform any one of the model training methods or any one of the object detection methods described in the above embodiments.

In yet another embodiment, a computer program product comprising instructions is provided, which when run on a computer, causes the computer to perform any of the model training methods described in the previous embodiments or any of the object detection methods described in the previous embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method of model training, comprising:

2. The method of claim 1, wherein the lightweight base model comprises a model consisting of a first preset number of convolutional layers in a mobile network model, the lightweight detection model comprises a detection model except a base network in a single-stage headless SSH model and a second preset number of convolutional layers, an Nth convolutional layer in the first preset number of convolutional layers is connected with the second preset number of convolutional layers, the second preset number of convolutional layers is connected with a detection module for detecting a target object with a preset size in the lightweight detection model, and N is smaller than the first preset number.

3. The method of claim 1, wherein the training of a preset detection model with the target image sample to obtain a target detection model comprises:

under the condition that the target image samples comprise image samples with various resolutions, training the preset detection model by adopting a first target image sample to obtain a first detection model, wherein the first target image sample is an image sample with any one resolution in the target image samples;

4. The method of claim 1, before the training of the preset detection model with the target image sample to obtain the target detection model, further comprising:

initializing the lightweight basic model according to the target classification model to obtain an initialized lightweight basic model;

the training of the preset detection model by adopting the target image sample to obtain the target detection model comprises the following steps:

5. The method of claim 1, wherein obtaining the target image sample comprises:

obtaining the original image;

6. The method of claim 5, further comprising:

under the condition that the original image does not need to be cut, judging whether the color conversion of the original image is needed or not; and if the original image needs to be subjected to color conversion, performing color conversion on the original image, and judging whether the size of the original image subjected to color conversion is adjusted or not according to the size of the original image subjected to color conversion so as to obtain the target image sample.

7. The method of claim 5, further comprising:

and under the condition that the color conversion of the cut image is not needed, judging whether the size of the cut image is adjusted or not according to the size of the cut image so as to obtain the target image sample.

8. A method of object detection, comprising:

inputting the image to be detected into the target lightweight basic model of any one of claims 1 to 7 to obtain a basic feature map corresponding to the preset anchor point;

9. The method of claim 8, wherein the pre-set anchor comprises at least one of a first pre-set anchor, a second pre-set anchor, and a third pre-set anchor, wherein the second pre-set anchor has a dimension that is greater than a dimension of the first pre-set anchor and less than a dimension of the third pre-set anchor;

10. The method according to claim 9, wherein the first primitive feature map is obtained from the to-be-detected image for a first predetermined number of first convolution layers in the target lightweight basis model, the second primitive feature map is obtained from the first primitive feature map for a second predetermined number of second convolution layers in the target lightweight basis model, and the third primitive feature map is obtained from the second primitive feature map for a third predetermined number of third convolution layers in the target lightweight basis model.

11. The method according to claim 9, wherein the first target detection feature map is a first detection feature map obtained by performing a convolution operation on the first basic feature map through a first detection module in the target lightweight detection model, or the first target detection feature map is a feature map obtained by fusing the first detection feature map and the second target detection feature map or the third target detection feature map;

12. A model training apparatus, comprising:

13. An object detection device, comprising:

a second obtaining module, configured to input the image to be detected into the target lightweight basis model according to any one of claims 1 to 7, so as to obtain a basic feature map corresponding to the preset anchor point;

14. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for performing the method steps of any of claims 1 to 7 or for performing the method steps of any of claims 8 to 11 when executing a program stored in a memory.

15. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7 or carries out the method according to any one of claims 8 to 11.