CN114882321A

CN114882321A - Deep learning model training method, target object detection method and device

Info

Publication number: CN114882321A
Application number: CN202210611399.1A
Authority: CN
Inventors: 陈子亮
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-08-09

Abstract

The present disclosure provides a training method and apparatus for a deep learning model, a target object detection method and apparatus, an electronic device, a storage medium, and a computer program product, which relate to the field of artificial intelligence, and in particular to the technical fields of deep learning, image processing, and computer vision, and can be used in scenes such as object detection and object recognition. The specific implementation scheme is as follows: determining a sample type and a sample frame of a target object in the sample image according to the sample image; the sample image includes a label of the target object; determining a classification loss value and a first regression loss value according to the sample category, the sample frame and the label; correcting the first regression loss value by using an adjusting factor to obtain a second regression loss value, wherein the adjusting factor indicates the regression difficulty degree of the sample image; and adjusting parameters of the deep learning model according to the classification loss value and the second regression loss value.

Description

Deep learning model training method, target object detection method and device

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and more particularly, to the field of computer vision, deep learning, and image processing technology. In particular, to a training method and apparatus for a deep learning model, a target object detection method and apparatus, an electronic device, a storage medium, and a computer program product.

Background

Hard Sample Mining (Hard Sample Mining) is one of the research directions for target detection. The difficult samples refer to samples which cannot be correctly classified or are difficult to classify by the model in the training process. If the difficult samples can be mined, the problem of imbalance of the difficult samples can be effectively solved, so that the model learns better feature expression, and the accuracy of model output is improved.

Disclosure of Invention

The present disclosure provides a training method and apparatus for a deep learning model, a target object detection method and apparatus, an electronic device, a storage medium, and a computer program product.

According to one aspect of the present disclosure, there is provided a training method of a deep learning model, including: determining a sample type and a sample frame of a target object in the sample image according to the sample image; the sample image includes a label of the target object; determining a classification loss value and a first regression loss value according to the sample category, the sample frame and the label; correcting the first regression loss value by using an adjusting factor to obtain a second regression loss value, wherein the adjusting factor indicates the regression difficulty degree of the sample image; and adjusting parameters of the deep learning model according to the classification loss value and the second regression loss value.

According to another aspect of the present disclosure, there is provided a target object detection method including: and inputting the image to be detected into a deep learning model to obtain the category information and the positioning information of the target object in the image to be detected, wherein the deep learning model is obtained by training by using the deep learning model training method.

According to another aspect of the present disclosure, there is provided a training apparatus for a deep learning model, including: the first determining module is used for determining the sample type and the sample frame of the target object in the sample image according to the sample image; the sample image includes a label of the target object; the calculation module is used for determining a classification loss value and a first regression loss value according to the sample category, the sample frame and the label; the correction module is used for correcting the first regression loss value by using the adjustment factor to obtain a second regression loss value, and the adjustment factor indicates the regression difficulty degree of the sample image; and the adjusting module is used for adjusting the parameters of the deep learning model according to the classification loss value and the second regression loss value.

According to another aspect of the present disclosure, there is provided a target object detecting apparatus including: and the detection module is used for inputting the image to be detected into the deep learning model to obtain the category information and the positioning information of the target object in the image to be detected, wherein the deep learning model is obtained by utilizing the training device of the deep learning model.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method provided in accordance with the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method provided according to the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements a method provided according to the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of an exemplary system architecture for a training method, a target object detection method and apparatus to which deep learning models may be applied, according to an embodiment of the present disclosure;

FIG. 2 is a flow diagram of a method of training a deep learning model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a training method of a deep learning model according to an embodiment of the present disclosure;

FIG. 4 is a flow chart diagram of a target object detection method according to an embodiment of the present disclosure;

FIG. 5 is a block diagram of a training apparatus for deep learning models, according to an embodiment of the present disclosure;

FIG. 6 is a block diagram of a target object detection apparatus according to an embodiment of the present disclosure; and

fig. 7 is a block diagram of an electronic device for a training method of a deep learning model and a target object detection method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the technical field of computer vision and the like, an object detection task is a core task. The target detection task may include classification branches for target recognition and regression branches for target localization.

With the development of the technical fields of computer vision and the like, people gradually pay attention to the problem of difficult sample mining in the target detection process, and a difficult sample mining strategy is proposed to solve the problem of imbalance of difficult and easy samples. However, these hard sample mining strategies focus mainly on classification branches of target detection, while regression branches lack hard sample mining strategies, thereby neglecting the impact of regression branches on detection box accuracy.

Fig. 1 is a schematic diagram of an exemplary system architecture of a training method and a target object detection method and apparatus to which a deep learning model may be applied, according to one embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, the system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired and/or wireless communication links, and so forth.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications may be installed on the

terminal devices

101, 102, 103. Such as a knowledge reading class application, a web browser application, a search class application, an instant messaging tool, a mailbox client or social platform software, etc. (just examples).

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (for example only) providing support for websites browsed by users using the

terminal devices

101, 102, 103. The backend management server may analyze and process the received data such as the user request, and feed back a processing result (for example, a web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the training method of the deep learning model provided by the embodiment of the present disclosure may be generally performed by the server 105. Accordingly, the training device for the deep learning model provided by the embodiment of the present disclosure may be generally disposed in the server 105.

Alternatively, the training method of the deep learning model provided by the embodiment of the present disclosure may also be performed by a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the training device for the deep learning model provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

It should be noted that the target object detection method provided by the embodiment of the present disclosure may be generally executed by the server 105. Accordingly, the target object detection apparatus provided by the embodiments of the present disclosure may be generally disposed in the server 105. The target object detection method provided by the embodiment of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the target object detection apparatus provided in the embodiments of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

Alternatively, the target object detection method provided by the embodiment of the present disclosure may also be generally executed by the

terminal device

101, 102, or 103. Accordingly, the target object detection apparatus provided in the embodiment of the present disclosure may also be disposed in the

terminal device

101, 102, or 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

It should be noted that the sequence numbers of the respective operations in the following methods are merely used as representations of the operations for description, and should not be construed as representing the execution order of the respective operations. The method need not be performed in the exact order shown, unless explicitly stated.

Fig. 2 is a flow diagram of a method of training a deep learning model according to an embodiment of the present disclosure.

As shown in FIG. 2, the training method 200 of the deep learning model may include operations S210-S240.

In operation S210, a sample category and a sample border of a target object in a sample image are determined according to the sample image.

In operation S220, a classification loss value and a first regression loss value are determined according to the sample class, the sample border, and the label.

In operation S230, the first regression loss value is corrected by using the adjustment factor to obtain a second regression loss value.

In operation S240, parameters of the deep learning model are adjusted according to the classification loss value and the second regression loss value.

According to an embodiment of the present disclosure, the sample image may be any one or more frames of images in a video stream acquired by a camera, or may be acquired in other manners, which is not limited in the present disclosure. The sample image may include one or more target objects and a label for the target objects. The target object may refer to various objects, such as a face or other objects, for example, and is not limited specifically. The label of the target object is used to indicate the category information and the positioning information of the target object in the sample image.

According to the embodiment of the disclosure, the sample image is input into the deep learning model, and the sample category and the sample frame of the target object in the sample image can be obtained. It is to be understood that the deep learning model may be any one of deep learning models for target detection, such as, but not limited to, a yolo (young Only Look one) series, an R-CNN (regions with CNN features) series, an SSD (Single Shot multi box Detector) model, a RetinaNet model, and the like, and may be specifically selected according to an actual application scenario.

According to an embodiment of the disclosure, the classification loss value characterizes a classification loss of the sample image, and the first regression loss value characterizes a regression loss of the sample image.

According to an embodiment of the disclosure, the correcting the first regression loss value by using the adjustment factor to obtain the second regression loss value may be, for example, by multiplying the adjustment factor by the first regression loss value to obtain the second regression loss value. Since the adjustment factor indicates the regression difficulty of the sample image, it is equivalent to assigning a corresponding weight to the corresponding regression loss of the sample image according to the regression difficulty of the sample image. In the model training process, hard samples usually occupy a small number of samples, while easy samples tend to occupy a large number of samples. According to the regression difficulty degree of the sample image, the first regression loss value is corrected by using the adjusting factor so as to give a smaller weight to the easy regression sample and give a larger weight to the difficult regression sample, so that the model focuses more on the difficult regression sample, the difficult regression sample is mined in the regression branch, and the accuracy of the prediction result of the regression branch is improved.

It should be noted that, in the present disclosure, a hard sample refers to a sample that a model cannot be classified correctly or is difficult to classify in a training process. Accordingly, an easy sample refers to a sample that the model is easy to correctly classify during the training process. Whether the model can correctly classify the samples can be defined according to the loss of the samples in the training process. For example, a hard-to-divide positive sample refers to a positive sample that is misclassified as a negative sample, and a hard-to-divide positive sample is a positive sample that is lost the most during training. The hard-to-divide negative sample is a negative sample which is wrongly divided into positive samples, and the hard-to-divide negative sample is a negative sample with the largest loss in the training process. For another example, a readily separable positive sample is a positive sample that is easily and correctly classified, and the readily separable positive sample is a positive sample with the least loss during the training process. The easily-divided negative sample is a negative sample which is easily and correctly classified, and the easily-divided negative sample is a negative sample with the minimum loss in the training process.

According to an embodiment of the present disclosure, a parameter of the deep learning model may be adjusted according to the classification loss value and the second regression loss value. In the embodiment of the present disclosure, the deep learning model may be trained by using a plurality of batches of sample images until the model converges. The process of training the model using each sample image is the same as or similar to the process described above and will not be described in detail here.

In the scheme of the embodiment of the disclosure, the regression loss value is corrected according to the regression difficulty degree of the sample image on the regression branch, so that a larger weight is given to the regression difficult sample and a smaller weight is given to the regression easy sample, so that the model focuses more on the regression difficult sample, thereby realizing mining of the difficult sample in the regression branch. By adopting the method disclosed by the invention, the model can optimize the samples which are difficult to regress in a key way under the condition of not increasing the calculated amount, so that the accuracy and the detection efficiency of the regression branch prediction result are improved.

According to an embodiment of the present disclosure, operation S210 may include the following operations.

And performing feature extraction operation on the sample image to obtain a plurality of multi-scale feature maps. And performing multi-scale fusion processing on the multiple multi-scale feature maps to obtain multiple multi-scale fusion feature maps. And determining a sample type and a sample frame of the target object in the sample image according to the multiple multi-scale fusion feature maps.

According to an embodiment of the present disclosure, in one example, the plurality of multi-scale feature maps includes multi-scale feature maps at N scales, where N is an integer greater than 1. The multi-scale fusion processing on the multiple multi-scale feature maps may adopt the following operations, for example: and (3) performing convolution operation on the multi-scale feature map (marked as the multi-scale feature map of the Nth scale) under the maximum scale (such as the Nth scale) to obtain the multi-scale fusion feature map of the Nth scale. And performing deconvolution operation on the multi-scale fusion feature map of the Nth scale to obtain a multi-scale fusion feature map of the Nth scale, and adding the multi-scale fusion feature map of the Nth scale and the multi-scale feature map of the Nth scale to obtain the multi-scale fusion feature map of the Nth scale-1. And performing deconvolution operation on the multi-scale fusion feature map of the N-1 th scale to obtain a multi-scale fusion feature map of the N-2 th scale, and adding the multi-scale fusion feature map of the N-2 nd scale and the multi-scale feature map of the N-2 nd scale to obtain a multi-scale fusion feature map of the N-2 nd scale. And by analogy, obtaining a plurality of multi-scale fusion feature maps. The multi-scale feature maps correspond to the multi-scale fusion feature maps respectively, and each multi-scale feature map is consistent with the size of the multi-scale fusion feature map corresponding to the multi-scale feature map.

In the embodiment of the disclosure, the sample type and the sample frame of the target object in the sample image are determined by using the multiple multi-scale fusion feature maps, so that the model can accurately detect the target object under different scales, and the target detection capability of the model is improved.

In one example, the tags of the target object may include, for example, category tags and location tags, according to embodiments of the present disclosure. The category label and the position label are used for indicating category information and positioning information of the target object in the sample image respectively. Operation S220 may include the following operations.

Determining a classification loss value according to the sample class and the class label; and determining a first return loss value according to the sample frame and the position label.

It will be appreciated that in determining the classification loss, any classification loss function may be employed to calculate a classification loss value based on the sample class and the class label. Illustratively, the classification loss value may be calculated using a cross-entropy loss function, but the present disclosure is not limited thereto. Similarly, any regression loss function may be used to calculate the first regression loss value based on the sample bounding box and the location label. For example, the first return loss value may be calculated using a giou (generalized interaction over unit) loss function, a CIOU loss function (Complete interaction over unit), a diou (distance interaction over unit) loss function, or a smoothening 11 loss function, etc.

According to an embodiment of the present disclosure, in one example, the following operations may be employed to determine the adjustment factor in operation S230.

And determining an adjusting factor according to the sample frame and the position label.

It will be appreciated that the location tag is used to indicate location information of the target object in the sample image. Since the sample border and the label indicating the category information and the location information of the target object in the sample image have already been acquired in operation S210. Therefore, additional data need not be acquired in determining the adjustment factor. Thus, difficult sample mining can be achieved in a simple manner without increasing the amount of computation.

According to an embodiment of the present disclosure, determining the adjustment factor according to the sample border and the location label may include the following operations.

Calculating the intersection ratio of the sample frame and the position label; and determining an adjustment factor according to the intersection ratio.

It can be understood that the larger the intersection ratio of the sample border to the position label, the easier the representative sample image regresses, and the smaller the intersection ratio of the sample border to the position label, the harder the representative sample image regresses. Therefore, the regression difficulty of the sample image can be measured according to the intersection ratio of the sample frame and the position label. Based on the above mechanism, the adjustment factor given to each sample image can be determined from the cross-over ratio.

As described above, the first regression loss value is corrected by the adjustment factor according to the regression difficulty level of the sample image, so as to give a smaller weight to the regression easy sample and a larger weight to the regression difficult sample. Accordingly, a smaller adjustment factor may be assigned to the regression-prone samples, and a larger adjustment factor may be assigned to the regression-difficult samples. Thus, the adjustment factor is inversely related to the cross-over ratio. Based on the relation, the regression loss of each sample image can be adaptively adjusted, so that the model can optimize samples which are difficult to regress in a key way under the condition of not increasing the calculated amount, and the accuracy of the regression branch prediction result is further improved.

In order to more accurately acquire the adjustment factor, the adjustment factor may be calculated using the following formula (1).

In formula (1), w represents an adjustment factor, l represents a position label,

representing a sample bounding box output via the model,

represents the intersection ratio of the position label and the sample border, and gamma represents the hyper-parameter.

In an example, the hyper-parameter γ may be, for example, a value between 0.8 and 1.2, and may be specifically set according to an actual situation.

Since the classification branches and regression branches do not interact, in some embodiments, the hard sample mining strategy for regression branches may be used in combination with the hard sample mining strategy for classification branches.

According to the embodiment of the disclosure, in the case that the classification branch adopts the hard sample mining strategy, determining the classification loss value according to the sample class and the class label may adopt the following operation, for example.

A classification loss value is determined based on a Focal loss function according to the sample class and the class label.

The Focal loss function is based on cross-entropy loss for the two classes. The cross entropy loss is a cross entropy loss of dynamic scaling, and the weight of an easy sample in the training process can be dynamically reduced through a dynamic scaling factor, so that a loss function can pay more attention to a difficult sample, and the nonuniformity of positive and negative samples is balanced.

The Focal loss function may be calculated using the following equation (2).

In the formula (2), L _cls Denotes the Focal loss function, alpha and eta denote hyper-parameters,

represents the sample class output via the model, and y represents the class label.

In the embodiment of the disclosure, the difficult sample mining strategy is adopted for both the classification branch and the regression branch, so that the model optimizes the difficult samples in the classification branch and the regression branch, and the prediction effect of the model is further improved without increasing the calculation amount.

According to an embodiment of the present disclosure, a deep learning model may include, for example, a feature extraction module, a feature fusion module, and a target detection module. In one example, the feature extraction module may be configured to perform a feature extraction operation on the sample image, resulting in a plurality of multi-scale feature maps. The feature fusion module can be used for carrying out multi-scale fusion processing on the multiple multi-scale feature maps to obtain the multiple multi-scale fusion feature maps. The target detection module may be configured to determine a sample class and a sample bounding box of the target object in the sample image based on the plurality of multi-scale fused feature maps. In the above example, the process of performing the multi-scale fusion processing on the plurality of multi-scale feature maps using the feature fusion module may be the same as the process described above, but the present disclosure is not limited thereto.

According to an embodiment of the present disclosure, operation S240 may include the following operations.

Determining a joint loss value according to the classification loss value and the second regression loss value; and adjusting parameters of the feature extraction module, the feature fusion module and the target detection module according to the joint loss value.

According to an embodiment of the present disclosure, the classification loss value may be determined based on a Focal loss function, or may be determined in other ways, which is not limited in particular.

It can be understood that, since the classification branch and the regression branch do not affect each other, a preset weight can be given to the loss value of each branch according to actual conditions, so as to determine the proportion of the classification loss value and the second regression loss value in the joint loss value, thereby adjusting the parameters of the model more accurately.

In the embodiment of the disclosure, parameters of the feature extraction module, the feature fusion module and the target detection module are adjusted by using the joint loss value, so that the model can at least optimize a sample which is difficult to regress in a key manner, thereby improving the accuracy and the detection efficiency of the regression branch prediction result.

Fig. 3 is a schematic diagram of a training method of a deep learning model according to an embodiment of the present disclosure. The scheme of the present disclosure will be explained below with reference to fig. 3.

As shown in fig. 3, the deep learning model 300 includes a feature extraction module 310, a feature fusion module 320, and an object detection module 330. The following describes the scheme of the present disclosure in detail by taking an example of training the deep learning model 300 by using the sample image R. Wherein the sample image R may comprise at least one target object and a category label y and a location label l of the at least one target object.

The sample image R is input into the feature extraction module 310 to perform a feature extraction operation on the sample image R, resulting in a plurality of multi-scale feature maps Fr. And performing multi-scale fusion processing on the multiple multi-scale feature maps Fr by using the feature fusion module 320 to obtain multiple multi-scale fusion feature maps Fe. And detecting the multiple multi-scale fusion feature maps Fe by using a target detection module 330 to obtain a sample type Sc and a sample frame Sb of the target object in the sample image R.

A classification loss value Lc is calculated 301 from the sample class Sc and the class label y. And calculating 302 a second regression loss value Lr according to the sample frame Sb, the adjusting factor w and the position label l. Wherein the adjustment factor w is determined based on the sample border Sb and the position label l. In one example, the adjustment factor w may be calculated according to equation (1).

A combined loss value Lu is calculated 303 from the classification loss value Lc and the second regression loss value Lr. The parameters of the feature extraction module 310, the feature fusion module 320 and the target detection module 330 are adjusted using the joint loss value Lu.

In some embodiments, the classification loss value Lc may be determined based on a Focal loss function from the sample class Sc and the class label y. Therefore, the difficult samples can be optimized in the classification branch and the regression branch, and the prediction effect of the model is further improved under the condition of not increasing the calculation amount.

According to the embodiment of the disclosure, the deep learning model can be trained by using a plurality of batches of sample images until the model converges. The process of training the model using each sample image is the same as or similar to the process described above and will not be described in detail here. The trained deep learning model may be used for detection of a target object, and a target object detection method will be described below with reference to fig. 4.

Fig. 4 is a flowchart of a target object detection method according to an embodiment of the present disclosure.

As shown in fig. 4, the target object detection method 400 includes operations S410 to S430.

In operation S410, an image to be detected is acquired.

According to an embodiment of the present disclosure, the sample image may be any one or more frames of images in a video stream acquired by a camera, or may be acquired in other manners, which is not limited in the present disclosure.

The image to be detected may comprise one or more target objects. The target object may refer to various objects such as a face, a vehicle, or other objects, for example, and is not limited herein.

In operation S420, a deep learning model is acquired.

According to the embodiment of the present disclosure, the deep learning model referred to herein is trained based on the training method of the deep learning model described in any one of the above embodiments.

In operation S430, the image to be detected is input into the deep learning model, so as to obtain category information and positioning information of the target object in the image to be detected.

According to an embodiment of the present disclosure, the category information of the target object indicates a category to which the target object belongs in the sample image. The positioning information of the target object indicates the position of the target object in the sample image. Therefore, the target object in the image to be detected can be accurately detected.

In the scheme of the embodiment of the disclosure, since the deep learning model obtained by training through the method can mainly optimize the hard sample, the accuracy and the efficiency of target object detection can be improved by detecting the image to be detected through the deep learning model.

It should be noted that operations S410 and S420 may be executed in parallel. However, the embodiments of the present disclosure are not limited thereto, and the two sets of operations may be performed in other orders, for example, first performing operation S420 and then performing operation S410.

Fig. 5 is a block diagram of a training apparatus for deep learning models according to an embodiment of the present disclosure. As shown in fig. 5, the training apparatus 500 for deep learning model includes a first determining module 510, a calculating module 520, a modifying module 530 and an adjusting module 540.

The first determining module 510 is configured to determine a sample class and a sample border of a target object in a sample image according to the sample image. The sample image includes a label of the target object.

The calculation module 520 is configured to determine a classification loss value and a first regression loss value according to the sample category, the sample border, and the label.

The correcting module 530 is configured to correct the first regression loss value by using the adjustment factor to obtain a second regression loss value. The adjustment factor indicates the ease of regression of the sample image.

The adjusting module 540 is configured to adjust parameters of the deep learning model according to the classification loss value and the second regression loss value.

According to an embodiment of the present disclosure, the tags include category tags and location tags. The training apparatus 500 for deep learning model further includes a second determining module. The second determining module is used for determining the adjusting factor according to the sample frame and the position label.

According to an embodiment of the present disclosure, the second determination module includes a first calculation unit and a determination unit. The first calculation unit is used for calculating the intersection ratio of the sample frame and the position label; and the determining unit is used for determining the adjusting factor according to the intersection ratio.

According to an embodiment of the present disclosure, the adjustment factor is inversely related to the cross-over ratio.

According to an embodiment of the present disclosure, a computing module includes a second computing unit and a third computing unit. The second calculating unit is used for determining a classification loss value according to the sample class and the class label; and the third calculation unit is used for determining a first return loss value according to the sample frame and the position label.

According to an embodiment of the present disclosure, the second calculation unit includes a calculation subunit. The calculation subunit is configured to determine a classification loss value based on a Focal loss function according to the sample class and the class label.

According to an embodiment of the present disclosure, a first determination module includes a feature extraction unit, a fusion unit, and a detection unit. The characteristic extraction unit is used for executing characteristic extraction operation on the sample image to obtain a plurality of multi-scale characteristic graphs; the fusion unit is used for carrying out multi-scale fusion processing on the multi-scale feature maps to obtain a plurality of multi-scale fusion feature maps; and the detection unit is used for determining the sample type and the sample frame of the target object in the sample image according to the multiple multi-scale fusion feature maps.

According to an embodiment of the disclosure, a deep learning model comprises a feature extraction module, a feature fusion module and a target detection module; the adjusting module comprises a fourth calculating unit and an adjusting unit. The fourth calculating unit is used for determining a joint loss value according to the classification loss value and the second regression loss value; and the adjusting unit is used for adjusting parameters of the feature extraction module, the feature fusion module and the target detection module according to the joint loss value.

Fig. 6 is a block diagram of a target object detection apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the target object detecting apparatus 600 includes a first acquiring module 610, a second acquiring module 620, and a detecting module 630.

The first obtaining module 610 is used for obtaining an image to be detected.

The second obtaining module 620 is used for obtaining a deep learning model. The deep learning model is obtained by training by using the training device of the deep learning model in any one of the above embodiments.

The detection module 630 is configured to input the image to be detected into the deep learning model, so as to obtain category information and positioning information of the target object in the image to be detected.

It should be noted that the implementation, solved technical problems, implemented functions, and achieved technical effects of each module/unit/subunit and the like in the apparatus part embodiment are respectively the same as or similar to the implementation, solved technical problems, implemented functions, and achieved technical effects of each corresponding step in the method part embodiment, and are not described herein again.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

In the technical scheme of the disclosure, before the personal information of the user is acquired or collected, the authorization or the consent of the user is acquired.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method according to an embodiment of the disclosure.

According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium having stored thereon computer instructions for causing a computer to perform a method as in an embodiment of the present disclosure.

According to an embodiment of the disclosure, a computer program product comprising a computer program which, when executed by a processor, implements a method as in an embodiment of the disclosure.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above, such as the training method of the deep learning model and the target object detection method. For example, in some embodiments, the training method of the deep learning model and the target object detection method may be implemented as computer software programs that are tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the training method of the deep learning model and the target object detection method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured in any other suitable way (e.g., by means of firmware) to perform the training method of the deep learning model and the target object detection method.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A training method of a deep learning model comprises the following steps:

determining a sample type and a sample frame of a target object in a sample image according to the sample image; the sample image includes a label of the target object;

determining a classification loss value and a first regression loss value according to the sample category, the sample border and the label;

correcting the first regression loss value by using an adjusting factor to obtain a second regression loss value, wherein the adjusting factor indicates the regression difficulty degree of the sample image; and

adjusting parameters of the deep learning model according to the classification loss value and the second regression loss value.

2. The method of claim 1, wherein the tags include category tags and location tags; the method further comprises the following steps:

and determining the adjusting factor according to the sample frame and the position label.

3. The method of claim 2, wherein the determining the adjustment factor from the sample bezel and the location tag comprises:

calculating the intersection ratio of the sample frame and the position label; and

and determining the adjusting factor according to the intersection ratio.

4. The method of claim 3, wherein the adjustment factor is inversely related to the cross-over ratio.

5. The method of claim 2, wherein the determining a classification loss value and a first regression loss value from the sample class, the sample border, and the label comprises:

determining the classification loss value according to the sample class and the class label; and

and determining the first return loss value according to the sample frame and the position label.

6. The method of claim 5, wherein the determining the classification loss value from the sample class and the class label comprises:

determining the classification loss value based on a Focal loss function according to the sample class and the class label.

7. The method of claim 1, wherein the determining, from the sample image, a sample class and a sample bounding box for the target object in the sample image comprises:

performing feature extraction operation on the sample image to obtain a plurality of multi-scale feature maps;

performing multi-scale fusion processing on the multi-scale feature maps to obtain a plurality of multi-scale fusion feature maps; and

and determining the sample category and the sample frame of the target object in the sample image according to the multiple multi-scale fusion feature maps.

8. The method of claim 1, wherein the deep learning model comprises a feature extraction module, a feature fusion module, and a target detection module; the adjusting parameters of the deep learning model according to the classification loss value and the second regression loss value comprises:

determining a joint loss value according to the classification loss value and the second regression loss value; and

and adjusting parameters of a feature extraction module, a feature fusion module and a target detection module according to the joint loss value.

9. A target object detection method, comprising:

inputting an image to be detected into a deep learning model to obtain the category information and the positioning information of a target object in the image to be detected,

wherein the deep learning model is obtained by training by using the method of any one of claims 1-8.

10. A training apparatus for deep learning models, comprising:

the first determination module is used for determining the sample category and the sample frame of the target object in the sample image according to the sample image; the sample image includes a label of the target object;

the calculation module is used for determining a classification loss value and a first regression loss value according to the sample category, the sample border and the label;

the correction module is used for correcting the first regression loss value by using an adjusting factor to obtain a second regression loss value, wherein the adjusting factor indicates the regression difficulty degree of the sample image; and

and the adjusting module is used for adjusting the parameters of the deep learning model according to the classification loss value and the second regression loss value.

11. The apparatus of claim 10, wherein the tags comprise a category tag and a location tag; the device further comprises:

and the second determining module is used for determining the adjusting factor according to the sample frame and the position label.

12. The apparatus of claim 11, wherein the second determining means comprises:

the first calculation unit is used for calculating the intersection ratio of the sample frame and the position label; and

and the determining unit is used for determining the adjusting factor according to the intersection ratio.

13. The apparatus of claim 12, wherein the adjustment factor is inversely related to the cross-to-parallel ratio.

14. The apparatus of claim 11, wherein the computing module comprises:

a second calculating unit, configured to determine the classification loss value according to the sample class and the class label; and

and the third calculating unit is used for determining the first return loss value according to the sample frame and the position label.

15. The apparatus of claim 14, wherein the second computing unit comprises:

a calculating subunit, configured to determine the classification loss value based on a Focal loss function according to the sample class and the class label.

16. The apparatus of claim 10, wherein the first determining means comprises:

the characteristic extraction unit is used for executing characteristic extraction operation on the sample image to obtain a plurality of multi-scale characteristic graphs;

the fusion unit is used for carrying out multi-scale fusion processing on the multi-scale feature maps to obtain a plurality of multi-scale fusion feature maps; and

and the detection unit is used for determining the sample type and the sample frame of the target object in the sample image according to the multiple multi-scale fusion feature maps.

17. The apparatus of claim 10, wherein the deep learning model comprises a feature extraction module, a feature fusion module, and a target detection module; the adjustment module includes:

a fourth calculating unit, configured to determine a joint loss value according to the classification loss value and the second regression loss value; and

and the adjusting unit is used for adjusting parameters of the feature extraction module, the feature fusion module and the target detection module according to the joint loss value.

18. A target object detection apparatus comprising:

the detection module is used for inputting the image to be detected into the deep learning model to obtain the category information and the positioning information of the target object in the image to be detected,

wherein the deep learning model is obtained by training by using the device of any one of claims 10-17.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-9.

21. A computer program product comprising a computer program which, when executed by a processor, implements a method according to any one of claims 1 to 9.