CN112036457A

CN112036457A - Method and device for training target detection model and target detection method and device

Info

Publication number: CN112036457A
Application number: CN202010843340.6A
Authority: CN
Inventors: 黄超
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2020-12-04

Abstract

The disclosure relates to a method and a device for training a target detection model based on artificial intelligence, and a target detection method and a target detection device. The method for training the target detection model comprises the following steps: acquiring an image sample set; marking one or more targets in the image samples in the image sample set; training a reference model using the annotated set of image samples; constructing a target detection model, wherein the number of nodes contained in the target detection model is less than that of nodes contained in a reference model; and training a target detection model using the annotated set of image samples with reference to the attention features of the trained reference model. The method for detecting the target in the image comprises the following steps: the target detection model trained by the method for training the target detection model is used for detecting the target in the image.

Description

Method and device for training target detection model and target detection method and device

Technical Field

The present disclosure relates generally to the field of machine vision, and more particularly, to a method and apparatus for training a target detection model based on artificial intelligence, and a target detection method and apparatus.

Background

Object detection, the task of which is to find out objects or regions of interest (ROIs) in an image, has made a significant progress in recent years. In the automatic game test, target detection plays an important role and is the basis of automatic detection, and meanwhile, due to the fact that various targets have different appearances, shapes and postures and interference of factors such as illumination or shielding during imaging, the target detection is always a challenging task in the field of computer vision.

In the existing technical solutions, there are various target detection methods, such as a target detection scheme based on a Look Once (YOLO), a target detection scheme based on a cascade Region Convolutional Neural Network (RCNN), and the like. However, the models trained by these methods have a large number of nodes, which results in a high model complexity, a large number of model parameters, a slow derivation speed, and unsuitability for on-line deployment.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a method and an apparatus for training a target detection model based on artificial intelligence, and a target detection method and an apparatus.

According to a first aspect of the present disclosure, there is provided a method of training a target detection model, comprising: acquiring an image sample set; marking one or more targets in the image samples in the image sample set; training a reference model using the annotated set of image samples; constructing a target detection model, wherein the number of nodes contained in the target detection model is less than that of nodes contained in a reference model; and training a target detection model using the annotated set of image samples with reference to the attention features of the trained reference model. Here, the number of nodes included in the model is used to represent the scale of the model, generally speaking, the more the number of nodes included in the model is, the more complicated the model is, and the more model parameters are included, the slower the target detection speed is; conversely, the smaller the number of nodes included in the model, the lighter the model is, and the smaller the number of model parameters included in the model is, the faster the target detection speed is. In some embodiments, the target detection model contains a smaller number of convolutional layers than the reference model; and/or the number of convolution channels contained in the target detection model is smaller than the number of convolution channels contained in the reference model, so that the number of nodes contained in the target detection model is smaller than the number of nodes contained in the reference model.

According to a second aspect of the present disclosure, there is provided a method of detecting a target in an image, comprising: the target detection model trained by the above method is used to detect the target in the image.

According to a third aspect of the present disclosure, there is provided an apparatus for training a target detection model, comprising: an image sample acquisition module configured to acquire a set of image samples; a target annotation module configured to annotate one or more targets in an image sample of a set of image samples; a model training module configured to: training a reference model using the annotated set of image samples; constructing a target detection model, wherein the number of nodes contained in the target detection model is less than that of nodes contained in a reference model; and training a target detection model using the annotated set of image samples with reference to the attention features of the trained reference model.

In some embodiments, the image sample acquisition module is further configured to acquire the set of image samples by: obtaining a plurality of image samples containing a target from an image set; deleting image samples of which the proportion of the target area to the image sample area is smaller than a threshold value from the acquired plurality of image samples; and deleting image samples with similarity higher than a threshold value from the acquired plurality of image samples.

In some embodiments, the target annotation module is further configured to annotate the one or more targets in the image sample by: one or more target boxes are provided in the image sample for labeling an x-coordinate of a center position of the one or more targets, a y-coordinate, a width of the one or more targets, and a height of the one or more targets. The target annotation module is further configured to annotate one or more targets in the image sample by: the categories of the one or more targets are labeled.

In some embodiments, the model training module is further configured to train the reference model by: scaling the image samples in the labeled image sample set so that the image samples are consistent in size; setting a loss function of the reference model; and iteratively updating parameters of the reference model in a manner that gradually reduces a penalty function of the reference model.

In some embodiments, the reference model predicts the class and location of the target by three different scales of convolved signatures; the loss function includes a target classification loss and a target position offset loss, and the target classification loss and the target position offset loss have different weights; the iteration is performed by means of a gradient backward pass.

In some embodiments, the target detection model contains a smaller number of convolutional layers than the reference model; and/or the target detection model comprises a smaller number of convolution channels than the reference model.

In some embodiments, the loss function of the object detection model comprises a loss of attention between attention features of the trained reference model and attention features of the object detection model, wherein the model training module is further configured to train the object detection model with reference to the trained reference model by: calculating attention features of the trained reference model; iteratively performing the following steps to gradually reduce a loss function of the target detection model: calculating attention characteristics of the target detection model; calculating an attention loss of the object detection model based on Euclidean distances between the attention features of the reference model and the attention features of the object detection model; and updating parameters of the object detection model based on the calculated attention loss of the object detection model.

In some embodiments, the model training module is further configured to calculate the attention feature of the trained reference model by: accumulating absolute values of convolution features at the same position in the convolution feature spectrum of the trained reference model; and normalizing the accumulated result.

In some embodiments, the convolved signature includes three different scales of convolved signatures; the attention feature of the reference model and the attention feature of the target detection model are attention features calculated through convolution feature spectrums of three different scales; the loss function of the object detection model further includes an object classification loss and an object position offset loss, and the attention loss, the object classification loss and the object position offset loss have different weights.

According to a fourth aspect of the present disclosure, there is provided an apparatus for detecting an object in an image, comprising: and the target detection module is configured to use the target detection model trained by the method for detecting the target in the image.

According to a fifth aspect of the present disclosure, there is provided a computing device comprising a processor and a memory, the memory having stored thereon instructions that, when executed on the processor, cause the processor to perform the above-described method of training an object detection model or method of detecting an object in an image.

According to a sixth aspect of the present disclosure, there is provided one or more computer-readable storage media having instructions stored thereon, which when executed on one or more processors, cause the one or more processors to perform the above-described method of training an object detection model or method of detecting an object in an image.

According to the technical scheme, the trained reference model containing more nodes is referred to train the target detection model containing less nodes, knowledge can be learned from the reference model containing more nodes, and therefore the detection effect of the target detection model is improved. Furthermore, the target detection model is trained by referring to the attention features of the trained reference model, the number of channels of the feature spectrum of the target detection model is not required to be consistent with the number of channels of the reference model, the target detection model with fewer nodes can be constructed, and the further light weight of the target detection model is facilitated, so that the method is more suitable for application with higher requirements on real-time degree and more suitable for on-line deployment. In addition, because the number of nodes of the target detection model is reduced, the embodiment of the disclosure can also accelerate the reasoning speed of the target detection model and improve the training efficiency of the target detection model.

These and other aspects of the disclosure will be apparent from and elucidated with reference to the embodiments described hereinafter.

Drawings

Further details, features and advantages of the disclosure are disclosed in the following description of exemplary embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 schematically illustrates an exemplary object detection system according to an embodiment of the disclosure;

FIG. 2 schematically illustrates a flow diagram of a method of training a target detection model according to an embodiment of the present disclosure;

FIG. 3A schematically shows an example of one image sample of a set of acquired image samples;

FIG. 3B schematically illustrates an example of an annotated image sample;

FIG. 3C schematically shows an example of a scaled image sample;

FIG. 3D schematically shows an example of applying a feature spectrum of scale 4 x 4 to an image sample;

FIG. 4A schematically illustrates a specific example of a reference model according to an embodiment of the present disclosure;

FIG. 4B schematically shows a specific example of an object detection model according to an embodiment of the disclosure;

FIG. 5 schematically illustrates a specific example of a method of training a target detection model based on a reference model according to an embodiment of the present disclosure;

FIG. 6 schematically shows a flow chart of a method of detecting an object in an image according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates an example of a further application scenario according to an embodiment of the present disclosure;

FIG. 8A schematically illustrates a block diagram of an apparatus for training a target detection model according to an embodiment of the present disclosure;

FIG. 8B schematically shows a block diagram of an apparatus for detecting an object in an image according to an embodiment of the present disclosure; and is

FIG. 9 schematically shows a block diagram of a computing device in accordance with an embodiment of the disclosure.

Detailed Description

Before describing embodiments of the present disclosure in detail, some related concepts are explained first.

Artificial Intelligence (AI): AI is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. AI base technologies generally include, for example, sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating/interactive systems, mechatronics, computer vision, speech processing, natural language processing, and machine learning/deep learning.

Computer Vision (CV) is the science of how to make a machine "look" in a variety of research directions of AI technology, and more specifically, means that a camera and a Computer are used to replace human eyes to perform machine Vision such as recognition, tracking and measurement on a target, and further perform graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and further include biometric technologies such as face recognition and fingerprint recognition.

Attention (attention): is a very common, but overlooked fact. For example, when a bird in the sky flies over, the human attention tends to follow the bird, and the sky naturally becomes background (background) information in the human visual system. The basic idea of attention mechanism in computer vision is to let the system learn to focus on places of interest, ignoring background information and focusing on important information.

Loss Function (Loss Function): also called cost function or objective function, to measure how inconsistent the predicted value f (x) of the model is from the true value Y, usually a non-negative real-valued function denoted L (Y, f (x)). In general, the smaller the value of the loss function (i.e., the loss value), the better the model fits, and the stronger the predictive power for new data. The loss function is a 'baton' of the training model in deep learning, and guides model parameter learning by back propagation of errors generated by marking of prediction samples and real samples. When the loss value of the loss function gradually decreases (converges), it can be considered that the model training is completed.

FIG. 1 schematically illustrates an exemplary object detection system 100 in which various methods described herein may be implemented, according to an embodiment of the disclosure. As shown in FIG. 1, the object detection system 100 includes an object detection model training server 110, one or more electronic devices 130, and optionally also a network 120.

The object detection model training server 110 stores and executes instructions, for example, that may perform the various methods of training an object detection model described herein, which may each be a single server or a cluster of servers. It should be understood that the servers referred to herein are typically server computers having a large amount of memory and processor resources, but other embodiments are possible.

Examples of network 120 include a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), and/or a combination of communication networks such as the Internet. The object detection model training server 110 may include at least one communication interface (not shown) capable of communicating over the network 120. Such communication interfaces may be one or more of the following: any type of network interface (e.g., a Network Interface Card (NIC)), wired or wireless (such as an IEEE 802.11 Wireless LAN (WLAN)) interface, a worldwide interoperability for microwave Access (Wi-MAX) interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, Bluetooth^TMAn interface, a Near Field Communication (NFC) interface, etc.

The electronic device 130 may be any type of mobile computing device, including a mobile computer or mobile computing device (e.g., Micr)osoft^® Surface^®Device, Personal Digital Assistant (PDA), laptop computer, notebook computer, such as Apple iPad^TMTablet computer, netbook, etc.), mobile phones (e.g., cellular phones, such as Microsoft Windows^®Smartphone and Apple iPhone of telephone, and Google is realized^® Android^TMTelephone and Palm of operating system^®Device, Blackberry^®Devices, etc.), wearable computing devices (e.g., smart watches, head-mounted devices, including smart glasses, such as Google @)^® Glass^TMEtc.) or other types of mobile devices. The electronic device 130 may also be any type of stationary computing device, such as a desktop computer. Further, the plurality of electronic devices 130 may be the same or different types of computing devices.

The electronic device 130 may include a display screen 131 and an application 132 that may interact with a user via the display screen 131. The electronic device 130 may interact with, e.g., send data to or receive data from, the object detection model training server 110, e.g., via the network 120. The application 132 may be a native application, a Web page (Web) application, or an applet (LiteApp) that is a lightweight application. In the case where the application 132 is a native application that needs to be installed, the application 132 may be installed in the electronic device 130. In the case where application 132 is a Web application, application 132 may be accessed through a browser. In the case that the application 132 is an applet, the application 132 can be directly opened on the electronic device 130 by searching relevant information of the application 132 (such as a name of the application 132, etc.), scanning a graphic code of the application 132 (such as a barcode, a two-dimensional code, etc.), and the like, without installing the application 132.

It should be understood that although the object detection model training server 110 and the electronic device 130 are shown and described herein as separate structures, they may be different components of the same computing device. For example, the target detection model training server 110 may provide background computing functionality, while the electronic device 130 may provide foreground display functionality.

FIG. 2 schematically shows a flow diagram of a method 200 of training a target detection model according to an embodiment of the disclosure. The method 200 for training the object detection model may be implemented on the object detection model training server 110, or may be implemented on the electronic device 130. For example, in one embodiment, the method of training the object detection model is implemented on the object detection model training server 110 and sent to the electronic device 130 after the object detection model training is completed in order to detect the object in the image. In another embodiment, the method of training the object detection model is implemented on the electronic device 130 and the objects in the detection image are implemented locally, in which case the object detection model training server 110 and the network 120 shown in FIG. 1 are not required. In yet another embodiment, the method of training the object detection model is implemented on one electronic device 130 and sent to another electronic device 130 after the object detection model training is completed in order to detect the object in the image. In further embodiments, the method 200 of training the object detection model may also be performed by the object detection model training server 110 in combination with the electronic device 130. The training of the target detection model and the implementation location of target detection are not exhaustive here.

The following description will be made with reference to fig. 3A to 3D, 4A to 4B, and 5, taking the method 200 as an example to be executed on the target detection model training server 110 side, and taking the target detection model training in the gun battle game as an example. 3A-3D schematically illustrate examples of operations performed on a sample set of images. Fig. 4A-4B schematically show specific examples of a reference model and a target detection model, respectively, according to an embodiment of the disclosure. FIG. 5 schematically illustrates a specific example of a method 500 of training a target detection model based on a reference model according to an embodiment of the present disclosure.

It should be understood that the step numbering used herein is merely used to distinguish between different steps and is not intended to indicate that the steps need to be performed in the order numbered. Rather, steps may be performed in an order different than the numbered order, some steps may be performed in parallel, and some steps may be performed repeatedly. It should also be understood that the interfaces shown in fig. 3A-3D may be interfaces of applications 132 displayed on display screen 131 of electronic device 130.

The method 200 may provide a configuration interface that may provide for interaction with the purpose of configuring parameters. For example, an entity or individual, including but not limited to a target detection model vendor, may configure various parameters through a configuration interface. In some embodiments, the configuration interface may be implemented as an interactive interface that includes options or input boxes.

At step 201, the target detection model training server 110 may obtain a set of image samples. In some embodiments, acquiring the set of image samples further comprises: obtaining a plurality of image samples containing a target from an image set; deleting image samples of which the proportion of the target area to the image sample area is smaller than a threshold value from the acquired plurality of image samples; and deleting image samples with similarity higher than a threshold value from the acquired plurality of image samples.

In some embodiments, an image sample of a gunfight game may be captured by capturing an image captured in a recorded video of the game, or captured during the game. For example, an image sample may be captured by capturing a game frame every 2 seconds of a game recording video. A game image containing a virtual game character may then be selected from the captured image sample. For example, if the area of the virtual game character is less than 1/400 of the total area of the image sample, the sample is deleted. In addition, if the similarity of more than two image samples is high (e.g., up to 90%), pruning may also be performed with the goal of preventing model overfitting.

Fig. 3A schematically shows an example of one image sample of the acquired set of image samples. As shown in fig. 3A, in the acquired image sample, information relating to the game as a background is included in addition to the virtual game character as a target. For example, a minimap of the game, targeting information, the life of the virtual game character, armor information, and other information may be included in the image sample.

At step 202, the target detection model training server 110 may annotate one or more targets in image samples in the set of image samples. In some embodiments, annotating one or more targets in an image sample comprises: one or more target boxes are provided in the image sample for labeling an x-coordinate of a center position of the one or more targets, a y-coordinate, a width of the one or more targets, and a height of the one or more targets. For example, after acquiring the set of image samples, the target detection model training server 110 may identify targets in the image samples and label one or more targets. By way of further example, after obtaining the set of image samples, the target detection model training server 110 may send the set of image samples to the one or more electronic devices 130, and then the target detection model training server 110 may receive the user-annotated image samples from the one or more electronic devices 130; at this time, the target detection model training server 110 only needs to read the labeling information in the image sample, and can complete the labeling of one or more targets.

FIG. 3B schematically shows an example of an annotated image sample. As shown in fig. 3B, the x-coordinate, y-coordinate of the center positions of the virtual game characters T1 and T2, i.e., T1 (x 1, y 1) and T2 (x 2, y 2), the widths W1 and W2 of the virtual game characters T1 and T2, and the heights H1 and H2 of the virtual game characters T1 and T2 are labeled in the image sample shown in fig. 3A.

In some embodiments, annotating one or more targets in the image sample further comprises: the categories of the one or more targets are labeled. For example, if it is desired that the object detection model be capable of detecting multiple objects, one or more classes of objects may be labeled in the image sample. For example, in a gunfight game, if the virtual game character is distinguished from the background, only the virtual game character needs to be labeled, and additional labeling of category information is not needed; if it is intended to further distinguish different virtual game pieces (e.g. to distinguish between a latently situated player and a defender in a game), it is necessary to label both latently situated and defender categories; if the game props are distinguished, the targets can comprise three categories of protective props, recovery props and shooting props. It should be understood that in other scenarios, the targets may be other categories of targets, and are not exhaustive here. It should be understood that "labeling" herein means making the machine aware of the target, such as the position, size and/or class of the target, for machine learning, and the manner of labeling is not exhaustive.

At step 203, the target detection model training server 110 may train a reference model using the annotated set of image samples. For example, a reference model such as a YOLO-based target detection model, a cascaded RCNN-based target detection model, etc. may be trained using the annotated set of image samples. In the target detection model based on the YOLO, firstly, the prior frame scale and the aspect ratio of a target are defined, the depth features of an image are extracted through a convolution layer, a pooling layer and the like, and then, the position and the category of the target are predicted based on the convolution feature spectrums of three scales. In the cascade RCNN-based target detection model, a plurality of Intersection Over Union (IOU) thresholds are set to determine corresponding positive and negative samples, then corresponding target detection modules are trained, and the target detection precision can be improved through cascade connection of the plurality of target detection modules.

Fig. 4A schematically illustrates a specific example of a reference model according to an embodiment of the present disclosure. As shown in fig. 4A, in some embodiments, the reference model is trained using a YOLO V3-based target detection model, specifically darknet53 trained using ImageNet big data. The darknet53 target detection model is a deep network containing 53 convolutional layers. In some embodiments, as shown in fig. 4A, the fully connected layer of darknet53 is removed, using 4 + 1 + 2 × 2 + 1 + 2 × 8 + 1 + 2 × 8 + 1 + 2 × 4 = 52 convolutional layers therein.

In some embodiments, training the reference model further comprises: scaling the image samples in the labeled image sample set so that the image samples are consistent in size; setting a loss function of the reference model; and iteratively updating parameters of the reference model in a manner that gradually reduces a penalty function of the reference model.

In some embodiments, the annotated image sample is scaled to 416 x 416 pixels and the convolved feature spectra are extracted over multiple convolution layers. FIG. 3C schematically illustrates an example of a scaled image sample, where square portions of a rectangular image sample may be cut out and scaled equally as shown in FIG. 3C. Alternatively, the long edge of the rectangular image sample may be scaled to 416 pixels before filling in the blank image on one or both sides of the long edge of the image sample. In addition, the image samples may also be scaled to 416 × 416 pixels using different scaling ratios for the long and short sides of the rectangular image sample.

In some embodiments, the reference model may predict the class and location of the target by convolving the signature over one or more scales. Fig. 3D schematically shows an example of applying a feature spectrum of scale 4 × 4 to an image sample, which is divided into 4 × 4 squares as shown in fig. 3D, in order to extract the feature spectrum. It should be understood that the 4 x 4 scale of the signature is merely an example, and other scales of the signature may be used in embodiments of the present disclosure. For example, three scales of feature spectra may be used, the sizes of the feature spectra of the first scale, the second scale and the third scale being 13 × 13, 26 × 26 and 52 × 52, respectively, i.e., the image sample of 416 × 416 pixels is divided into 13 × 13, 26 × 26 and 52 × 52 squares, respectively, so that the sizes of each square are 32 × 32 pixels, 16 × 16 pixels and 8 × 8 pixels, respectively. Wherein the first scale 13 x 13 signature contains larger squares for detecting large scale targets; the second scale 26 x 26 feature spectrum is used for detecting the medium scale target; the third scale 52 x 52 profile contains denser squares for detecting small scale targets.

In some embodiments, the penalty function for the reference model may include a target classification penalty and a target position offset penalty. Further, the target classification penalty may be weighted differently from the target position offset penalty in order to minimize the value of the penalty function (penalty value).

For example, the target classification Loss (i.e., class Loss) can be a classical class Cross Entropy Loss (Cross-Entropy Loss) as follows:

wherein the content of the first and second substances,L _clsrepresenting the target classification loss, N is the number of target candidate boxes, C is the number of classes, y_i,kIs shown asiWhether the individual target candidate frame is the firstkClass (i.e., object class), y'_i,kIs shown asiThe target candidate frame iskClass score of individuals, χ_iIs the image region corresponding to the ith candidate box, and f represents the mapping of the image region to the category score. In some embodiments, the number of categories C may be set to 2 if it is intended to distinguish the virtual game character from the background, i.e., the first category is background and the second category is character.

For example, the target position offset penalty (i.e., the penalty of position fitting) may be a classical L1 penalty used to optimize the detection of the position of the target box. The formula is as follows:

wherein the content of the first and second substances,L _locrepresents the target position offset loss, g_iIs shown asiPosition information (e.g., including x-coordinate, y-coordinate, width, and height), χ, of the individual target frame_iIs shown asiImage area corresponding to each frame candidate, b_iThe position (e.g., including x-coordinate, y-coordinate, width, and height) of the candidate box is represented. In the embodiment of the disclosure, by the above manner, the target classification loss and the target position offset loss are used to perform the class training and the positioning training on the detection network in the reference model together, so that the robustness of the model is improved.

It should be understood that in embodiments of the present disclosure, the loss function of the reference model may also include other kinds of losses, such as identifying losses. In addition, the target classification loss may further include a focus loss (focus loss), etc.; the target offset penalty may further include an intersection-to-parallel ratio (IOU) penalty or the like. The target position offset loss may be in the form of an L2 loss function, a smooth L1 loss function, or the like. The present disclosure is not intended to be exhaustive herein.

In some embodiments, the loss function is iterated by passing the gradient backwards such that the loss value of the loss function is gradually reduced. For example, the reference model may be trained according to a pre-labeled target box and an optional target class, and the training may be stopped when the number of iterations reaches a threshold or a loss value of the loss function is lower than a threshold, so that the training of the reference model may be completed. It should be understood that the model iteration may be performed in a backward-propagation (also called Back propagation) manner using a sigmoid function or the like, and may be performed in any other manner as long as the loss value of the loss function is gradually reduced. The present disclosure is not intended to be exhaustive herein. After the training of the reference model is finished, the features of the three scales can be input into the trained model, and corresponding detection results are obtained through the convolutional layers respectively. For example, in the case where feature spectra of three scales of 13 × 13, 26 × 26, and 52 × 52 are adopted, and 3 detection results are output per grid cell, a total of (52 × 52 + 26 × 26 + 13 × 13) × 3 = 10647 detection results are obtained from the reference model. And then, the deviation value of the position predicted by the model is utilized, and the final target detection result can be obtained through post-processing based on the position of the candidate frame.

At step 204, the target detection model training server 110 may construct a target detection model that contains a smaller number of nodes than the reference model. It should be understood that the number of nodes included in the model is used herein to represent the scale of the model, however, other metrics may be used to represent the scale of the model, such as the complexity of the model, the number of model parameters, the network scale of the model, the target detection speed of the model, and so on. That is, it can be said that the complexity of the target detection model is lower than that of the reference model, the number of model parameters of the target detection model is smaller than that of the reference model, the network scale of the target detection model is smaller than that of the reference model, and the target detection speed of the target detection model is faster than that of the reference model. Specifically, in some embodiments, the target detection model contains a smaller number of convolutional layers than the reference model; and/or the number of convolution channels contained by the target detection model is less than the number of convolution channels contained by the reference model, so that the number of nodes contained by the target detection model is less than the number of nodes contained by the reference model.

Fig. 4B schematically illustrates a specific example of an object detection model according to an embodiment of the present disclosure. The target detection model extracts the convolution characteristic spectrum in the image through 5 convolution layers. Compared with the darknet53 model, the target detection model reduces the number of channels and the number of convolutional layers, thereby reducing the number of nodes contained in the target detection model, reducing the computational complexity, being lighter and more suitable for on-line deployment.

For example, in the object detection model of fig. 4B, the convolution kernel size of each of the 5 convolutional layers is 4, i.e., a convolution kernel of 4 × 4 pixels can be used. The step size of the convolution kernel shift during convolution is 2, i.e. the convolution kernel shifts 2 pixels at a time when performing convolution. The outputs of the 5 convolutional layers are 120, 240, 128, 256, 512 convolutional channels, respectively. The outputs of the last three convolutional layers may be used to detect the target. In some embodiments, a set of image samples containing three scales, 13 × 13, 26 × 26, and 52 × 52, of 416 × 416 pixels in size is also input for the object detection model. For example, after the features of three scales are input into the target detection module, corresponding detection results can be obtained through the convolution layer respectively. It should be understood that various forms of object detection models may be constructed as long as they contain a smaller number of nodes than the reference model, and the present disclosure is not exhaustive.

At step 205, the target detection model training server 110 may train the target detection model using the annotated set of image samples with reference to the attention features of the trained reference model. In some embodiments, the knowledge of the trained reference model about attention may be transferred to the constructed target detection model, thereby improving the detection capability of the target detection model, whereby higher detection accuracy may be achieved using a detection model with a smaller number of nodes. The process of transferring the attention-related knowledge of the reference model into the target detection model may be referred to as "attention-distillation". In some embodiments, the method 500 illustrated in FIG. 5 may be employed to train a target detection model with reference to a trained reference model.

At step 501, the target detection model training server 110 may calculate attention features of the trained reference models. In some embodiments, calculating the attention feature of the trained reference model further comprises: accumulating absolute values of convolution features at the same position in the convolution feature spectrum of the trained reference model; and normalizing the accumulated result.

For example, for a feature spectrum with a scale of 52 × 52 and a channel number of 128, the size of the output feature spectrum is 52 × 52 × 128. I.e. both width and height contain 52 lattices. By adding up the absolute values of the convolution features of all channels located at the same position (e.g., the same grid), a signature spectrum with a size of 52 × 52 × 1 can be obtained. The reason for this accumulation is that if the saliency of an object at a location is high (which means that there is a high probability that the object is at that location, i.e. the attention characteristics of the object are large), the value of the corresponding profile should be large. By performing the accumulation operation to obtain the attention feature, it can be known which regions (e.g. grids) in the image have higher significance.

In some embodiments, the accumulated features may also be normalized, thereby resulting in normalized attention features for subsequent processing. During the normalization process, the accumulated feature value may be divided by the sum of the feature spectra, and expressed as follows:

wherein f is_n ^'Is a normalized feature, f_nIs the accumulated feature before normalization, N is the number of accumulated features (e.g., 52 x 52), eps is a fraction that prevents the denominator from being zero, and can be set to, for example, 10^-6。

At step 502, the target detection model training server 110 may compute the attention features of the target detection model in a manner similar to step 501. In some embodiments, the initial value of the attention feature for each position of the target detection model may be set to 0. It should be appreciated that step 502-505 are performed iteratively (in a loop) during the training of the object detection model.

By extracting the attention feature in the above manner, it is possible to normalize feature spectra having different numbers of channels (e.g., 120, 240, 128, 256, 512 channels of the target detection model shown in fig. 4B) to the same size. Therefore, the limitation of the number of channels is eliminated, so that the number of the channels of the constructed target detection model can be different from the number of the channels of the reference model, and the target detection model with less nodes can be constructed. Furthermore, the attention feature can indicate which locations in the image are more important for the detected object. It will be appreciated that where the reference model employs three scales of convolved feature spectra, the attention features may be extracted separately from the three scales of convolved feature spectra.

At step 503, the target detection model training server 110 may calculate the attention loss of the target detection model. In some embodiments, the attention loss of the object detection model may be calculated based on Euclidean distances (Euclidean distances) between the attention features of the reference model and the attention features of the object detection model. For example, in the case of a convolved feature spectrum of three scales, the following equation may be used:

wherein the content of the first and second substances,L _atindicating a loss of attention, x₁Attention feature, x, being a first scale of the output of the reference model₁ ^'Attention feature, x, of a first scale of the target detection model output₂Attention feature, x, of a second scale of the output of the reference model₂ ^'Attention being a second dimension of the target detection model outputForce characteristic, x₃Attention feature, x, of a third scale output with reference to the detection model₃ ^'Is the attention feature of the third scale of the target detection model output. It will be appreciated that the above illustrated loss of attentionL _atWhere three terms are included to account for the three scales of the convolved feature spectrum, the attention loss calculation can be modified accordingly, increasing or decreasing the number of terms, in the case where more or fewer scales of convolved feature spectra are used. It should also be understood that Euclidean Distance is used here for illustrative purposes only, and in other embodiments, attention loss may also be calculated using other means, such as Manhattan Distance (Manhattan Distance), Chebyshev Distance (Chebyshev Distance), Minkowski Distance (Minkowski Distance), Mahalanobis Distance (Mahalanobis Distance), Cosine Distance (Cosine Distance), Hamming Distance (Hamming Distance), Jaccard Distance (Jaccard Distance), and so forth.

At step 504, it may be determined whether the target detection model converges based on the attention loss calculated in step 503. For example, in the first few iterations, the target detection model may be considered to not converge; after several iterations, it may be determined whether the target detection model converges by determining whether the calculated attention loss is gradually reduced with respect to the previously calculated attention loss.

In some embodiments, in addition to this loss of attention, the loss function of the object detection model also includes losses associated with object detection, such as the object classification loss and the object position offset loss mentioned above. The target classification penalty and the target position offset penalty may be calculated in a similar manner as in step 203. In addition, the attention loss, the target classification loss, and the target positional shift loss have different weights. In some embodiments, the weights for the various losses may be set based on experimentation. In this case, it is necessary to determine whether the target detection model converges in consideration of whether the overall loss function value gradually decreases. In the embodiment of the disclosure, in this way, attention loss, target classification loss and target position offset loss are jointly used for performing attention distillation, class training and positioning training on a detection network in a target detection model, so that the robustness of the model is improved.

It should be understood that, as a specific judgment basis, it is possible to subtract the average value of the attention loss or total loss function values calculated several times before from the attention loss or total loss function value calculated this time and judge whether the absolute value of the difference value is smaller than the threshold value. In addition, other ways may be used to make the determination. For example, whether the target detection model converges may be determined by determining whether the attention loss or the overall loss function value is less than a threshold, or simply whether the number of iterations reaches a threshold. The present disclosure is not intended to be exhaustive herein.

In the event that it is determined that the target detection model does not converge, then at step 505, the parameters of the target detection model may be updated based on the attention loss calculated in step 503, and the flow will return to 502 for iteration. In some embodiments, iteration may also be performed in a manner similar to step 203 (e.g., using gradient back-propagation) to minimize the loss of attention of the target detection model.

In the event that it is determined that the target detection model converges, then at step 506, the training of the target detection model may be ended. The method of fig. 5 and 2 ends so far.

By the method for training the target detection model, which is disclosed by the embodiment of the disclosure, the attention characteristics of the target detection model based on the YOLO or the detection model based on the cascaded RCNN, for example, which contains more nodes, are transferred to the target detection model containing fewer nodes by adopting an attention distillation mode, so that the detection capability of the target detection model can be improved. In addition, the attention distillation of the embodiment of the disclosure does not require the feature sizes of distillation to be consistent, and compared with a depth estimation method based on knowledge distillation, the number of convolution channels can be reduced, and the number of nodes of a target detection model can be further reduced, so that the target detection model is more suitable for application with higher requirements on real-time degree and is more suitable for on-line deployment. In addition, because the number of nodes of the target detection model is reduced, the embodiment of the disclosure can also accelerate the reasoning speed of the training target detection model and improve the training efficiency of the target detection model.

Fig. 6 schematically shows a flow chart of a method of detecting an object in an image according to an embodiment of the present disclosure. In some embodiments, the object detection method illustrated in fig. 6 may be performed, for example, in one or more electronic devices 130, to detect objects contained in images obtained at one or more electronic devices 130, for example, to detect images in applications 132 displayed on display screens 131 of one or more electronic devices 130. In some embodiments, step 601-. Therefore, a detailed description of step 601-605 is omitted herein. It should be understood that, in the case that the training of the target detection model has been completed by using step 201-205 in fig. 2 or the trained target detection model is obtained in other ways, step 601-605 in fig. 6 can be omitted and the trained target detection model can be directly used to detect the target in the image.

In some embodiments, at step 606, a trained object detection model may be used to detect objects in the image. For example, inputting game images into a trained object detection model may output the position of game characters in the images. Further, the detected position information of the game character can be input into the gunfight game AI model, so that important basis of action can be provided for the game AI.

By the target detection method, a more accurate target detection result can be obtained by using the target detection model with less nodes, and the method is more suitable for application with higher requirements on real-time degree and more suitable for on-line deployment.

The above examples of gun battle type games are only used for understanding the present solution, and it should be understood that embodiments of the present disclosure may also include, but are not limited to, applications in unmanned systems, unmanned aerial vehicle flight systems, security systems, and the like. Fig. 7 schematically shows an example of a further application scenario according to an embodiment of the present disclosure. As shown in fig. 7, the embodiment of the present disclosure may be applied in a scenario of automatic driving.

As shown in fig. 7, in the scenario of automatic driving, targets may be labeled into more categories. At this time, the label at step 202 (or step 602) in fig. 2 needs to be changed. For example, the method can be labeled as four categories of background, static object, slow-moving object and fast-moving object. For example, the house and trees shown in FIG. 7 may be labeled as background, which is outside the road, in terms of impact on autonomous driving; marking the zebra crossing, the driving lane and traffic lights which are not shown in the figure 7 as static objects in the road, which are used as the control basis for automatic driving; marking the pedestrian entering or about to enter the road shown in the figure 7 as a slow-moving object, wherein the moving speed of the pedestrian is slow; the other vehicles in the road shown in fig. 7 are labeled as fast moving objects.

In some embodiments, the method of fig. 2 may be employed after the annotation is completed for training of a target detection model using the annotated set of image samples or for detecting targets in images obtained during automated driving using the method of fig. 6. In some embodiments, at steps 203 and 205 (or steps 603 and 605), the target classification loss may instead be calculated according to the four categories described above, respectively.

In some embodiments, after various targets (stationary objects, slow-moving objects, fast-moving objects) in an image obtained in an automatic driving process are detected by a target detection model, driving can be performed according to the indication of stationary road signs or traffic lights, and moving objects in a road or possibly entering the road need to be avoided. When employing avoidance strategies, it is also necessary to specify different strategies taking into account the speed of the moving object, e.g. slow (quasi-stationary) or fast. In addition, detected road information (e.g., "zebra stripes in front of 50 m") and other information (e.g., current power and speed information) may also be displayed on the display screen 131.

In the automatic driving scene of the embodiment of the disclosure, the target in the image can be more accurately detected by the target detection model with a small number of nodes, so that the target detection time can be reduced, and the target detection can be faster and more accurately performed, thereby being beneficial to making faster judgment and avoidance by an automatic driving system and improving the safety of automatic driving.

FIG. 8A schematically illustrates a block diagram of an apparatus 800 for training an object detection model, in which various methods of training an object detection model described herein may be implemented, according to an embodiment of the disclosure. Fig. 8B schematically illustrates a block diagram of an apparatus 810 for detecting an object in an image, in which various methods of detecting an object described herein may be implemented, according to an embodiment of the present disclosure. As shown in fig. 8A, the apparatus 800 for training a target detection model may include an image sample acquisition module 801, a target labeling module 802, and a model training module 803. As shown in fig. 8B, the means 810 for detecting an object in an image may comprise an object detection module 804. In addition, the apparatus 810 for detecting objects in an image may further optionally include an image sample acquisition module 801, an object labeling module 802, and a model training module 803, so that the apparatus 810 can train an object detection model as well as detect various objects in an image using the trained object detection model.

The image sample acquisition module 801 is configured to acquire a set of image samples. In some embodiments, the image sample acquisition module 801 is further configured to acquire a set of image samples in the manner described in step 201 of the method 200, for example: obtaining a plurality of image samples containing a target from an image set; deleting image samples of which the proportion of the target area to the image sample area is smaller than a threshold value from the acquired plurality of image samples; and deleting image samples with similarity higher than a threshold value from the acquired plurality of image samples.

The target annotation module 802 is configured to annotate one or more targets in an image sample of a set of image samples. In some embodiments, the target annotation module 802 is further configured to annotate one or more targets in the image sample in the manner described in step 202 of the method 200, such as: one or more target boxes are provided in the image sample for labeling an x-coordinate of a center position of the one or more targets, a y-coordinate, a width of the one or more targets, and a height of the one or more targets. In some embodiments, the target annotation module 802 may be further configured to annotate one or more targets in the image sample by: the categories of the one or more targets are labeled.

The model training module 803 is configured to train a reference model using the annotated set of image samples; constructing a target detection model, wherein the number of nodes contained in the target detection model is less than that of nodes contained in a reference model; and training a target detection model using the annotated set of image samples with reference to the attention features of the trained reference model. In some embodiments, the loss function of the object detection model comprises a loss of attention between attention features of the trained reference model and attention features of the object detection model, wherein the model training module is further configured to train the object detection model by: calculating attention features of the trained reference model; iteratively performing the following steps to gradually reduce a loss function of the target detection model: calculating attention characteristics of the target detection model; calculating an attention loss of the object detection model based on Euclidean distances between the attention features of the reference model and the attention features of the object detection model; and updating parameters of the object detection model based on the calculated attention loss of the object detection model.

The target detection module 804 is configured to use the trained target detection model to detect targets in the image.

Although specific functionality is discussed above with reference to particular modules, it should be noted that the functionality of the various modules discussed herein may be divided into multiple modules and/or at least some of the functionality of multiple modules may be combined into a single module. Additionally, a particular module performing an action discussed herein includes the particular module itself performing the action, or alternatively the particular module invoking or otherwise accessing another component or module that performs the action (or performs the action in conjunction with the particular module). Thus, a particular module that performs an action can include the particular module that performs the action itself and/or another module that the particular module that performs the action calls or otherwise accesses.

The various modules described above with respect to fig. 8A and 8B may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the modules may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, the modules may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of the image sample acquisition module 801, the target annotation module 802, the model training module 803, and the target detection module 804 may be implemented together in a system on a chip (SoC). The SoC may include an integrated circuit chip including one or more components of a processor (e.g., a Central Processing Unit (CPU), microcontroller, microprocessor, Digital Signal Processor (DSP), etc.), memory, one or more communication interfaces, and/or other circuitry, and may optionally execute received program code and/or include embedded firmware to perform functions. The features of the techniques described herein are carrier-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.

Fig. 9 schematically illustrates a block diagram of a computing device 900 in accordance with an embodiment of the disclosure. The computing device 900 is representative of one or more of the object detection model training server 110 and the electronic device 130 included with the object detection system 100 of FIG. 1.

Computing device 900 may be a variety of different types of devices, such as a server computer, a device associated with a client (e.g., a client device), a system on a chip, and/or any other suitable computing device or computing system.

Computing device 900 may include at least one processor 902, memory 904, communication interface(s) 906, display device 908, other input/output (I/O) devices 910, and one or more mass storage devices 912, which may be connected to communicate with each other, such as by a system bus 914 or other appropriate means.

The processor 902 may be a single processing unit or a plurality of processing units, all of which may include single or multiple computing units or multiple cores. The processor 902 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitry, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor 902 may be configured to retrieve and execute computer readable instructions, such as program code of an operating system 916, program code of an application 918, program code of other programs 920, and so forth, stored in the memory 904, the mass storage 912, or other computer readable media to implement the method 200 of training an object detection model or the object detection method 600 provided by embodiments of the present disclosure.

Memory 904 and mass storage device 912 are examples of computer storage media for storing instructions that are executed by processor 902 to perform the various functions described above. By way of example, the memory 904 may generally include both volatile and nonvolatile memory (e.g., RAM, ROM, and the like). In addition, the mass storage device 912 may generally include a hard disk drive, a solid state drive, removable media (including external and removable drives), memory cards, flash memory, floppy disks, optical disks (e.g., CDs, DVDs), storage arrays, network storage, storage area networks, and so forth. Memory 904 and mass storage device 912 may both be referred to herein collectively as memory or computer storage media, and may be non-transitory media capable of storing computer-readable, processor-executable program instructions as computer program code that may be executed by processor 902 as a particular machine configured to implement the operations and functions described in the examples herein.

A number of program modules may be stored on the mass storage device 912. These programs include an operating system 916, one or more application programs 918, other programs 920, and program data 922, which can be loaded into memory 904 for execution. Examples of such applications or program modules may include, for instance, computer program logic (e.g., computer program code or instructions) for implementing the following components/functions: an image sample acquisition module 801, an object labeling module 802, a model training module 803, an object detection module 804, and/or further embodiments described herein. In some embodiments, these program modules may be distributed at different physical locations, for example, on the object detection model training server 110 or the electronic device 130 shown in FIG. 1, to implement the respective functions.

Although illustrated in fig. 9 as being stored in memory 904 of computing device 900,

modules

916, 918, 920, and 922, or portions thereof, may be implemented using any form of computer-readable media that is accessible by computing device 900. As used herein, "computer-readable media" may include one or more types of computer-readable media, which may include, for example, computer storage media and/or communication media.

Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing device.

In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism. Computer storage media, as defined herein, does not include communication media.

Computing device 900 may also include one or more communication interfaces 906 for exchanging data with other devices, such as over a network, direct connection, or the like. Communication interface 906 may facilitate communications within a variety of networks and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet, and so forth. Communication interface 906 may also provide for communication with external storage devices (not shown), such as in a storage array, network storage, storage area network, or the like.

In some examples, a display device 908, such as a monitor, may be included for displaying information and images. Other I/O devices 910 may be devices that receive various inputs from a user and provide various outputs to the user, and may include touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input/output devices, and so forth.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the apparatus, and the modules described above may refer to corresponding processes in the foregoing method embodiments, and are not described herein again.

Variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed subject matter, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality; the terms "first", "second", "third" and "fourth" are used merely to distinguish one element or step from another, and do not indicate the order of the elements or steps. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. A method of training a target detection model, comprising:

acquiring an image sample set;

labeling one or more targets in an image sample in the set of image samples;

training a reference model using the annotated set of image samples;

constructing a target detection model, wherein the number of nodes contained in the target detection model is less than that of the nodes contained in the reference model; and

training the target detection model using the annotated set of image samples with reference to the trained attention features of the reference model.

2. The method of claim 1, wherein obtaining the set of image samples further comprises:

obtaining a plurality of image samples containing a target from an image set;

deleting image samples of which the proportion of the target area to the image sample area is smaller than a threshold value from the acquired plurality of image samples; and

and deleting the image samples with the similarity higher than the threshold from the acquired plurality of image samples.

3. The method of claim 1, wherein labeling the one or more targets in the image sample comprises:

setting one or more target boxes in the image sample for labeling an x-coordinate of a center position of the one or more targets, a y-coordinate, a width of the one or more targets, and a height of the one or more targets;

labeling the one or more targets in the image sample further comprises: labeling the category of the one or more targets.

4. The method of claim 1, wherein training the reference model further comprises:

scaling image samples in the annotated set of image samples such that they are consistent in size;

setting a loss function of the reference model; and

iteratively updating parameters of the reference model in a manner that gradually decreases a penalty function of the reference model.

5. The method of claim 4,

the reference model predicts the category and the position of a target through convolution characteristic spectrums of three different scales;

the loss function includes a target classification loss and a target position offset loss, and the target classification loss and the target position offset loss have different weights;

the iteration is performed by passing the gradient backwards.

6. The method of claim 1,

the number of convolutional layers contained in the target detection model is smaller than that contained in the reference model; and/or

The number of convolution channels contained in the target detection model is smaller than the number of convolution channels contained in the reference model.

7. The method of any one of claims 1-6,

the loss function of the object detection model comprises a loss of attention between the trained attention features of the reference model and the attention features of the object detection model,

wherein training the target detection model with reference to the trained reference model further comprises:

calculating attention features of the trained reference model;

iteratively performing the following steps to gradually reduce a loss function of the target detection model:

calculating attention characteristics of the target detection model;

calculating a loss of attention of the object detection model based on Euclidean distances between attention features of the reference model and attention features of the object detection model; and

updating parameters of the object detection model based on the calculated attention loss of the object detection model.

8. The method of claim 7, wherein computing attention features of the trained reference model further comprises:

accumulating absolute values of co-located convolution features in the trained convolution feature spectrum of the reference model;

and normalizing the accumulated result.

9. The method of claim 8,

the convolution characteristic spectrum comprises convolution characteristic spectrums with three different scales;

the attention feature of the reference model and the attention feature of the target detection model are attention features calculated through convolution feature spectrums of three different scales;

the loss function of the object detection model further includes an object classification loss and an object position offset loss, and the attention loss, the object classification loss and the object position offset loss have different weights.

10. A method of detecting an object in an image, comprising:

use of an object detection model trained by the method of any one of claims 1-9 for detecting objects in an image.

11. An apparatus for training a target detection model, comprising:

an image sample acquisition module configured to acquire a set of image samples;

a target annotation module configured to annotate one or more targets in an image sample of the set of image samples;

a model training module configured to:

training a reference model using the annotated set of image samples;

12. The apparatus of claim 11,

wherein the model training module is further configured to train the target detection model with reference to the trained reference model by:

calculating attention features of the trained reference model;

calculating attention characteristics of the target detection model;

13. An apparatus for detecting an object in an image, comprising:

an object detection module configured to use an object detection model trained by the method of any one of claims 1-9 for detecting objects in an image.

14. A computing device, comprising:

a processor; and

a memory having instructions stored thereon that, when executed on the processor, cause the processor to perform the method of any of claims 1-9 or perform the method of claim 10.

15. One or more computer-readable storage media having instructions stored thereon, which when executed on one or more processors cause the one or more processors to perform the method of any one of claims 1-9 or perform the method of claim 10.