CN111640103A

CN111640103A - Image detection method, device, equipment and storage medium

Info

Publication number: CN111640103A
Application number: CN202010478424.4A
Authority: CN
Inventors: 王康康
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Quanwang Zhishu Technology Co ltd
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2020-09-08
Anticipated expiration: 2040-05-29

Abstract

The embodiment of the application discloses an image detection method, an image detection device, image detection equipment and a storage medium, and relates to the technical field of artificial intelligence, deep learning and image processing. One embodiment of the method comprises: inputting an image to be detected into a pre-trained multi-task detection model, wherein the multi-task detection model comprises a position detection branch network and an attribute prediction branch network; processing the image to be detected by using a position detection branch network to obtain the position of a target in the image to be detected; processing the image to be detected by using an attribute prediction branch network to obtain the attribute of the target in the image to be detected; and correspondingly outputting the position and the attribute of at least part of the target in the image to be detected. According to the embodiment, the position detection and the attribute prediction are simultaneously carried out by utilizing the multi-task detection model comprising the position detection branch network and the attribute prediction branch network, and the attribute prediction branch network can predict the attributes of all targets at one time, so that the time consumption of multi-task detection is greatly reduced.

Description

Image detection method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the technical fields of artificial intelligence, deep learning and image processing, and particularly relates to an image detection method, device, equipment and storage medium.

Background

Target detection is a detection model in deep learning, and the target detection model can locate the bounding box of a plurality of targets in an image. With a general object detection model, when one image is input, only the positions of a plurality of objects in the image are output. When multitasking detection is required, multiple models are typically required to do so. For example, after the object detection model detects the bounding boxes of a plurality of objects in the image, the bounding boxes of the plurality of objects are respectively extracted from the original image in a predetermined format, and after the extracted bounding boxes of the plurality of objects are subjected to a small amount of conversion, the bounding boxes are sequentially input to the attribute prediction model to obtain the attribute values of the plurality of objects.

Disclosure of Invention

The embodiment of the application provides an image detection method, an image detection device, image detection equipment and a storage medium.

In a first aspect, an embodiment of the present application provides an image detection method, including: inputting an image to be detected into a pre-trained multi-task detection model, wherein the multi-task detection model comprises a position detection branch network and an attribute prediction branch network; processing the image to be detected by using a position detection branch network to obtain the position of a target in the image to be detected; processing the image to be detected by using an attribute prediction branch network to obtain the attribute of the target in the image to be detected; and correspondingly outputting the position and the attribute of at least part of the target in the image to be detected.

In a second aspect, an embodiment of the present application provides an image detection apparatus, including: the image input module is configured to input an image to be detected to a pre-trained multi-task detection model, wherein the multi-task detection model comprises a position detection branch network and an attribute prediction branch network; the position detection module is configured to process the image to be detected by utilizing the position detection branch network to obtain the position of a target in the image to be detected; the attribute prediction module is configured to process the image to be detected by using the attribute prediction branch network to obtain the attribute of the target in the image to be detected; and the corresponding output module is configured to correspondingly output the position and the attribute of at least part of the target in the image to be detected.

In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first aspect.

In a fourth aspect, embodiments of the present application propose a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described in any one of the implementations of the first aspect.

According to the image detection method, the image detection device, the image detection equipment and the image detection storage medium, firstly, an image to be detected is input into a multi-task detection model which is trained in advance; then, processing the image to be detected by using the position detection branch network to obtain the position of the target in the image to be detected, and simultaneously, processing the image to be detected by using the attribute prediction branch network to obtain the attribute of the target in the image to be detected; and finally, correspondingly outputting the position and the attribute of at least part of the target in the image to be detected. The multi-task detection model comprising the position detection branch network and the attribute prediction branch network is used for simultaneously carrying out position detection and attribute prediction, and the attribute prediction branch network can predict the attributes of all targets at one time, so that the time consumption of multi-task detection is greatly reduced. The method and the device avoid the problems that firstly, a target detection model is used independently to detect the target frame in the image, then an attribute prediction model is used independently to predict the attribute of the target in the target frame, and the prediction time of the attribute prediction model is increased linearly along with the number of the targets.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings. The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is an exemplary system architecture to which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of an image detection method according to the present application;

FIG. 3 is a schematic diagram of a network structure of a multitask detection model at the time of prediction;

FIG. 4 is a flow diagram of one embodiment of a multitasking detection model training method according to the present application;

FIG. 5 is a schematic diagram of a network structure of a multi-tasking detection model during training;

FIG. 6 is a schematic block diagram of one embodiment of an image detection apparatus according to the present application;

fig. 7 is a block diagram of an electronic device for implementing the image detection method according to the embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the image detection method or image detection apparatus of the present application may be applied.

As shown in fig. 1, a system architecture 100 may include a terminal device 101, a network 102, and a server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use terminal device 101 to interact with server 103 over network 102 to receive or send messages and the like. Various client applications, such as an image processing application and the like, may be installed on the terminal apparatus 101.

The terminal apparatus 101 may be hardware or software. When the terminal device 101 is hardware, it may be various electronic devices with a camera, including but not limited to a smart phone, a tablet computer, a laptop portable computer, a desktop computer, and the like. When the terminal apparatus 101 is software, it can be installed in the above-described electronic apparatus. It may be implemented as multiple pieces of software or software modules, or as a single piece of software or software module. And is not particularly limited herein.

The server 103 may be a server that provides various services, such as a background server for image processing applications. The background server of the image processing application may perform processing such as analysis on data of the image to be detected acquired from the terminal device 101, and feed back a processing result (for example, a position and an attribute of at least a part of the target in the image to be detected) to the terminal device 101.

The server 103 may be hardware or software. When the server 103 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 103 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the image detection method provided in the embodiment of the present application is generally executed by the server 103, and accordingly, the image detection apparatus is generally disposed in the server 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In the case where the image to be detected is stored in the server 103, the system architecture 100 may not provide the terminal apparatus 101 and the network 102. In addition, when the terminal device 101 has the detection capability, the image detection method provided in the embodiment of the present application may be executed by the terminal device 101, and accordingly, the image detection apparatus may be provided in the terminal device 101. At this time, the system architecture 100 may not be provided with the network 102 and the server 103.

With continued reference to FIG. 2, a flow 200 of one embodiment of an image detection method according to the present application is shown. The image detection method comprises the following steps:

step 201, inputting an image to be detected to a multi-task detection model trained in advance.

In the present embodiment, an executing subject of the image detection method (for example, the server 103 shown in fig. 1) may input the image to be detected to a multi-task detection model trained in advance.

In general, a terminal apparatus (e.g., the terminal apparatus 101 shown in fig. 1) can transmit an image to be detected to the execution main body described above. After receiving the image to be detected, the execution body can input the image to be detected into the multitask detection model. In addition, in the case that the image to be detected is locally stored, the execution main body may locally acquire the image to be detected and input the image to the multitask detection model.

The multi-task detection model can comprise a position detection branch network and an attribute prediction branch network. Both the location detection branch network and the attribute prediction branch network include a large number of convolutional layers. Typically, the location-detecting branch network includes a greater number of convolutional layers than the attribute-predicting branch network includes. The images can be input from an input layer of the multi-task detection model and are respectively sent to the position detection branch network and the attribute prediction branch network for processing. A location detection branch network may be used to detect the location of objects in the image. The location of the target may be where the target is in the image, including but not limited to the location of the target frame, the score, the index, and so forth. An attribute prediction branch network may be used to predict attributes of objects in an image. The property of the target may be a depiction of an abstract aspect of the target, a property and relationship that the target has, including but not limited to, keypoints, shapes, colors, and the like. In general, different attributes may be predicted using different attribute prediction branch networks. When multiple attributes of a target need to be predicted, multiple attribute prediction branch networks need to be correspondingly arranged in the multi-task detection model.

Step 202, processing the image to be detected by using the position detection branch network to obtain the position of the target in the image to be detected.

In this embodiment, the execution main body may process the image to be detected by using the position detection branch network, so as to obtain the position of the target in the image to be detected.

In general, an image input from the input layer of the multitask detection model can be further input to the position detection branch network, and the position of the target in the image can be output. In some embodiments, the location detection branch network may include a location detection network agent, a coordinate branch, and a scoring branch. The location detection network agent may be used to extract location features of objects in images input thereto. The coordinate branch may be used to predict the coordinates of the target frame in the image based on the position features input thereto. Wherein the target frame may be a bounding box of the target. The scoring branch may be used to predict a score of the target box in the image based on the location features input thereto. The higher the score, the greater the probability that an object is present in the corresponding object box.

And 203, processing the image to be detected by using the attribute prediction branch network to obtain the attribute of the target in the image to be detected.

In this embodiment, the execution main body may process the image to be detected by using the attribute prediction branch network, so as to obtain the attribute of the target in the image to be detected.

In general, an image input from the input layer of the multitask detection model can be further input to the attribute prediction branch network, and the attribute of the target in the image can be output. In some embodiments, an attribute prediction branch network may include an attribute prediction network body and an attribute branch. The attribute prediction network agent may be used to extract attribute features of objects in the image input thereto. The attribute branch may be used to predict an attribute of a target in an image based on the attribute features input thereto.

And 204, correspondingly outputting the positions and attributes of at least part of targets in the image to be detected.

In this embodiment, the execution subject may output the position and the attribute of at least a part of the target in the image to be detected.

In general, the execution body may first one-to-one correspond the positions and attributes of at least a part of the objects in the image to be detected, and then output the corresponding positions and attributes to the terminal device. In some embodiments, the location may include coordinates, scores, and indices of the target box. The execution body can correspondingly output the positions and the attributes of all targets in the image to be detected. At this time, the execution main body may search the attributes of the targets in the image to be detected by using the index of the target frame of each target, obtain the attribute of each target, and further correspondingly output the position and the attribute of each selected target. The execution body can also correspondingly output the position and the attribute of a part of targets in the image to be detected. At this time, the executing body may select at least a part of the targets in the image to be detected based on the scores of the target frames. For example, only targets with scores greater than a preset score threshold (e.g., 0.9 points) are selected. Subsequently, the execution main body may search the attributes of the targets in the image to be detected by using the index of the target frame of each selected target, obtain the attribute of each selected target, and further correspondingly output the position and the attribute of each selected target. And selecting the corresponding output position and attribute of the target based on the score, so that the output information can be ensured to be the position and attribute of the target with higher detection accuracy in the image. The attribute of the corresponding target is read through the index of the target frame, and the position of the target can be quickly corresponding to the attribute.

The image detection method provided by the embodiment of the application comprises the steps of firstly inputting an image to be detected into a multi-task detection model trained in advance; then, processing the image to be detected by using the position detection branch network to obtain the position of the target in the image to be detected, and simultaneously, processing the image to be detected by using the attribute prediction branch network to obtain the attribute of the target in the image to be detected; and finally, correspondingly outputting the position and the attribute of at least part of the target in the image to be detected. The multi-task detection model comprising the position detection branch network and the attribute prediction branch network is used for simultaneously carrying out position detection and attribute prediction, and the attribute prediction branch network can predict the attributes of all targets at one time, so that the time consumption of multi-task detection is greatly reduced. The method and the device avoid the problems that firstly, a target detection model is used independently to detect the target frame in the image, then an attribute prediction model is used independently to predict the attribute of the target in the target frame, and the prediction time of the attribute prediction model is increased linearly along with the number of the targets.

For ease of understanding, a schematic diagram of the network structure of the multitask detection model at the time of prediction is provided below. As shown in fig. 3, images may be input from the input layer of the multitask detection model and fed into the location detection network agent and the attribute prediction network agent, respectively. The position detection network body can extract the position characteristics of the target in the image and input the position characteristics to the coordinate branch and the scoring branch respectively. The coordinate branch may predict coordinates of the target frame in the image based on the location features. The scoring branch may predict a score of the target box in the image based on the location features. Meanwhile, the attribute prediction network main body can extract the attribute characteristics of the target in the image and input the attribute characteristics to the attribute branches respectively. The attribute branch may predict an attribute of a target in the image based on the attribute feature. And finally, searching in the attributes of the targets in the image by using the indexes of the target frames with the scores larger than the preset score threshold value, and correspondingly outputting the positions and the attributes of the targets with the scores larger than the preset score threshold value.

With further reference to FIG. 4, a flow 400 of one embodiment of a multitask detection model training method according to the present application is illustrated. The multi-task detection model training method comprises the following steps:

step 401, a sample image set is obtained.

In this embodiment, an executing subject (e.g., the server 103 shown in fig. 1) of the multitask detection model training method may obtain a sample image set. The sample images in the sample image set may be images obtained by shooting a target, and the images are labeled by the target. The target in the sample image may be labeled with the position and the attribute at the same time, only the position may be labeled, or only the attribute may be labeled.

Step 402, regarding a sample image in the sample image set, if a position and an attribute of a target in the sample image are labeled at the same time, taking the sample image as an input, and taking a labeled position and a labeled target attribute of the target in the sample image as an output, and training a multi-task detection model.

In this embodiment, for a sample image in a sample image set, if a position and an attribute of a target in the sample image are labeled at the same time, the execution subject may train the multi-task detection model by using the sample image as an input and using the labeled position and the labeled target attribute of the target in the sample image as an output.

Generally, a sample image can be used as the input of the position detection branch network and the attribute prediction branch network at the same time, the labeling position of a target in the sample image can be used as the output of the position detection branch network, the labeling attribute of the target in the sample image can be used as the output of the attribute prediction branch network, and the multi-task detection model is trained under supervision.

In the training stage of the multi-task detection model, if a large number of images with uniform formats suitable for both position detection training and attribute prediction training are collected, positions and attributes of targets in the images are usually labeled at the same time to obtain a sample image set. At this time, the execution subject may train the position detection branch network and the attribute prediction branch network of the multi-task detection model simultaneously by using the sample image set, thereby improving the training efficiency of the multi-task detection model.

Step 403, for a sample image in the sample image set, if the target in the sample image is only labeled with a position, taking the sample image as an input, and taking the labeled position of the target in the sample image as an output, training a position detection branch network in the multi-task detection model.

In this embodiment, for a sample image in a sample image set, if only a position of a target in the sample image is labeled, the execution subject may train a position detection branch network in the multitask detection model by using the sample image as an input and using the labeled position of the target in the sample image as an output.

In general, the sample image may be used as an input of the position detection branch network, and the labeled position of the target in the sample image may be used as an output of the position detection branch network, and there is a supervised training position detection branch network.

In some embodiments, the location detection branch network may include a location detection network agent, a coordinate branch, and a scoring branch. The execution subject may train the position detection branch network by:

firstly, a sample image is input to a position detection network main body, and the position characteristics of the sample image are obtained.

And then, respectively inputting the position characteristics of the sample image into the coordinate branch and the score branch to obtain the predicted coordinate and the predicted score of the target frame in the sample image.

Then, a position loss function is calculated based on the predicted coordinates and the predicted score of the target frame in the sample image, and the annotated position of the target.

Wherein the position loss function may characterize the difference between the predicted coordinates and the predicted score of the target box and the annotated position of the target.

Finally, parameters of the location detection branch network are adjusted based on the location loss function.

Wherein for each iteration of the training, it is first determined whether the position loss function calculated in that iteration of the training is minimized. If the position loss function is minimized, the convergence of the position detection branch network is shown. If the position loss function is not minimized, it indicates that the position detection branch network has not converged. At this time, the gradient back propagation is performed on the position detection branch network based on the position loss function, the parameters of the position detection branch network are adjusted, and the next round of iterative training is continued.

It should be noted that in the case where only the position detection branch network needs to be trained or upgraded, a large number of images suitable for the position detection training are usually collected. Subsequently, the positions of the targets in the images are labeled, and a sample image set is obtained. At this time, the execution subject can utilize the sample image set to train the position detection branch network of the multi-task detection model independently, so that the independent upgrade of the position detection branch network is realized without influencing the attribute prediction branch network.

Step 404, regarding the sample images in the sample image set, if the targets in the sample images are only labeled with attributes, taking the sample images as input, taking the labeled attributes of the targets in the sample images as output, and training an attribute prediction branch network in the multi-task detection model.

In this embodiment, for a sample image in a sample image set, if only an attribute of an object in the sample image is labeled, the execution subject may train an attribute prediction branch network in the multitask detection model by using the sample image as an input and using the labeled attribute of the object in the sample image as an output.

Generally, a sample image can be used as an input of the attribute prediction branch network, the labeling attribute of a target in the sample image can be used as an output of the attribute prediction branch network, and the attribute prediction branch network is trained in a supervision mode.

In some embodiments, an attribute prediction branch network may include an attribute prediction network body and an attribute branch. The executing agent may train the attribute prediction branch network by:

firstly, the sample image is input to an attribute prediction network main body, and the attribute characteristics of the sample image are obtained.

And then, inputting the attribute characteristics of the sample image into the attribute branch to obtain the prediction attribute of the target in the sample image.

Then, an attribute loss function is calculated based on the predicted attribute and the annotated attribute of the target in the sample image.

Wherein the attribute loss function may characterize a difference between the predicted attribute of the target and the annotated attribute of the target.

Finally, parameters of the attribute prediction branch network are adjusted based on the attribute loss function.

Wherein, for each iteration training round, it is first determined whether the attribute loss function calculated in the iteration training round is minimized. If the attribute loss function is minimized, convergence of the attribute prediction branch network is shown. If the attribute loss function is not minimized, it indicates that the attribute prediction branch network has not converged. At this time, the gradient back propagation is carried out on the attribute prediction branch network based on the attribute loss function, the parameters of the attribute prediction branch network are adjusted, and the next round of iterative training is continued.

It should be noted that, in the case where only the attribute prediction branch network needs to be trained or upgraded, a large number of images suitable for the attribute prediction training are usually collected. And then, marking attributes of the targets in the images to obtain a sample image set. At this time, the execution subject can utilize the sample image set to train the attribute prediction branch network of the multi-task detection model independently, so that the independent upgrade of the attribute prediction branch network is realized without influencing the position detection branch network.

According to the multi-task detection model training method provided by the embodiment of the application, for the sample image with the position and the attribute labeled simultaneously, the sample image is used as input, the labeled position and the labeled target attribute of the target in the sample image are used as output, and the position detection branch network and the attribute prediction branch network in the multi-task detection model are trained simultaneously. And for the sample image only marked with the position, taking the sample image as input, taking the marked position of the target in the sample image as output, and only training the position detection branch network in the multi-task detection model. And for the sample image only labeled with the attribute, taking the sample image as input, taking the labeled attribute of the target in the sample image as output, and only training the attribute prediction branch network in the multi-task detection model. And constructing a multi-task detection model with separable position detection branch network and attribute prediction branch network. In one aspect, training data for a location detecting branch network and an attribute predicting branch network are made separable. When the position is trained by using the sample image only labeled with the position to detect the branch network, the attribute prediction branch network is not influenced at all; when the attribute prediction branch network is trained using sample images labeled only with attributes, the location detection branch network is not affected at all. Of course, if the sample image is labeled with the position and the attribute at the same time, the position detection branch network and the attribute prediction branch network can be trained at the same time. The problem that the training data formats of the position detection branch network and the attribute prediction branch network are not uniform is solved. On the other hand, the upgrade iterations of the location detection branch network and the attribute prediction branch network are made separable. After the multi-task detection model is trained, the position detection branch network or the attribute prediction branch network can be upgraded independently and are not influenced by each other. The problem that the position detection branch network and the attribute prediction branch network must be updated simultaneously is avoided.

For ease of understanding, a schematic diagram of the network structure of the multi-tasking detection model at the time of training is provided below. As shown in fig. 5, a sample image may be input from an input layer of the multitask detection model, and if a target in the sample image is labeled with a position and an attribute at the same time, the sample image is sent to a position detection network agent and an attribute prediction network agent, respectively. If the target in the sample image is only marked with the position, the target is only sent to the position detection network main body. If the target in the sample image is only marked with the attribute, only the attribute prediction network main body is sent. The position detection network agent may extract the position feature of the target in the sample image input thereto, and input to the coordinate branch and the score branch, respectively. The coordinate branch may output predicted coordinates of the target box in the sample image. The scoring branch may output a predicted score for a target box in the sample image. Based on the predicted coordinates and the predicted score of the target frame in the sample image, and the annotated position of the target, a position loss function can be calculated. The parameters of the location detection branch network can be adjusted based on the location loss function. Similarly, the attribute prediction network subject may extract the attribute features of the target in the sample image input thereto, and input the attribute features to the attribute branches, respectively. The attribute branch may output a predicted attribute of the target in the sample image. Based on the predicted and annotated attributes of the target in the sample image, an attribute loss function can be calculated. The parameters of the attribute prediction branch network can be adjusted based on the attribute loss function.

With further reference to fig. 6, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an image detection apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 6, the image detection apparatus 600 of the present embodiment may include: an image input module 601, a position detection module 602, an attribute prediction module 603, and a correspondence output module 604. The image input module 601 is configured to input an image to be detected to a pre-trained multi-task detection model, wherein the multi-task detection model comprises a position detection branch network and an attribute prediction branch network; a position detection module 602 configured to process the image to be detected by using the position detection branch network, so as to obtain a position of a target in the image to be detected; the attribute prediction module 603 is configured to process the image to be detected by using the attribute prediction branch network to obtain the attribute of the target in the image to be detected; and a corresponding output module 604 configured to correspondingly output the position and the attribute of at least part of the target in the image to be detected.

In the present embodiment, in the image detection apparatus 600: the detailed processing and the technical effects of the image input module 601, the position detection module 602, the attribute prediction module 603, and the corresponding output module 604 can refer to the related descriptions of step 201 and step 204 in the corresponding embodiment of fig. 2, and are not described herein again.

In some optional implementations of this embodiment, the location includes coordinates, scores, and indices of the target box; and the corresponding output module 604 is further configured to: selecting at least part of targets in the image to be detected based on the scores of the target frames; searching in the attributes of the targets in the image to be detected by using the indexes of the target frames of each selected target to obtain the attributes of each selected target; and correspondingly outputting the position and the attribute of each selected target.

In some optional implementations of this embodiment, the image detection apparatus 600 further includes a model training module (not shown in the figure), and the model training module includes: a sample acquisition sub-module (not shown in the figure) configured to acquire a set of sample images; and a first training sub-module (not shown in the figure) configured to train the multi-task detection model by using the sample image as an input and using the labeling position and the labeling target attribute of the target in the sample image as an output, if the target in the sample image is labeled with the position and the attribute at the same time, for the sample image in the sample image set.

In some optional implementations of this embodiment, the model training module further includes: and a second training submodule (not shown in the figure) configured to train the position detection branch network in the multi-task detection model by taking the sample image as an input and the labeled position of the target in the sample image as an output if the target in the sample image only labels the position.

In some optional implementations of this embodiment, the location detection branch network includes a location detection network main body, a coordinate branch, and a scoring branch; and the second training submodule is further configured to: inputting the sample image to a position detection network main body to obtain the position characteristics of the sample image; inputting the position characteristics of the sample image into the coordinate branch and the score branch respectively to obtain a predicted coordinate and a predicted score of a target frame in the sample image; calculating a position loss function based on the predicted coordinates and the predicted score of the target frame in the sample image and the labeled position of the target; parameters of the location detection branch network are adjusted based on the location loss function.

In some optional implementations of this embodiment, the model training module further includes: and a third training sub-module (not shown in the figure) configured to train the attribute prediction branch network in the multi-task detection model by taking the sample image as an input and the labeled attribute of the target in the sample image as an output if the target in the sample image is labeled with only the attribute.

In some optional implementations of this embodiment, the attribute prediction branch network includes an attribute prediction network main body and an attribute branch; and the third training submodule is further configured to: inputting the sample image to an attribute prediction network main body to obtain the attribute characteristics of the sample image; inputting the attribute characteristics of the sample image into an attribute branch to obtain the prediction attribute of the target in the sample image; calculating an attribute loss function based on the prediction attribute and the annotation attribute of the target in the sample image; parameters of the attribute prediction branch network are adjusted based on the attribute loss function.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 7 is a block diagram of an electronic device according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 7, the electronic apparatus includes: one or more processors 701, a memory 702, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 7, one processor 701 is taken as an example.

The memory 702 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor to cause the at least one processor to perform the image detection method provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the image detection method provided by the present application.

The memory 702, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the image detection method in the embodiments of the present application (e.g., the image input module 601, the position detection module 602, the attribute prediction module 603, and the corresponding output module 604 shown in fig. 6). The processor 701 executes various functional applications of the server and data processing, i.e., implements the image detection method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 702.

The memory 702 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device of the image detection method, and the like. Further, the memory 702 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 702 may optionally include memory located remotely from the processor 701, which may be connected to the image detection method electronics over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the image detection method may further include: an input device 703 and an output device 704. The processor 701, the memory 702, the input device 703 and the output device 704 may be connected by a bus or other means, and fig. 7 illustrates an example of a connection by a bus.

The input device 703 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus of the image detection method, such as an input device of a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 704 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the application, firstly, an image to be detected is input to a multi-task detection model trained in advance; then, processing the image to be detected by using the position detection branch network to obtain the position of the target in the image to be detected, and simultaneously, processing the image to be detected by using the attribute prediction branch network to obtain the attribute of the target in the image to be detected; and finally, correspondingly outputting the position and the attribute of at least part of the target in the image to be detected. The multi-task detection model comprising the position detection branch network and the attribute prediction branch network is used for simultaneously carrying out position detection and attribute prediction, and the attribute prediction branch network can predict the attributes of all targets at one time, so that the time consumption of multi-task detection is greatly reduced. The method and the device avoid the problems that firstly, a target detection model is used independently to detect the target frame in the image, then an attribute prediction model is used independently to predict the attribute of the target in the target frame, and the prediction time of the attribute prediction model is increased linearly along with the number of the targets.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An image detection method, comprising:

inputting an image to be detected into a pre-trained multi-task detection model, wherein the multi-task detection model comprises a position detection branch network and an attribute prediction branch network;

processing the image to be detected by using the position detection branch network to obtain the position of a target in the image to be detected;

processing the image to be detected by using the attribute prediction branch network to obtain the attribute of the target in the image to be detected;

and correspondingly outputting the position and the attribute of at least part of the target in the image to be detected.

2. The method of claim 1, wherein the location comprises coordinates, scores, and indices of the target box; and

correspondingly outputting the position and the attribute of at least part of targets in the image to be detected, wherein the method comprises the following steps:

selecting at least part of targets in the image to be detected based on the scores of the target frames;

searching in the attributes of the targets in the image to be detected by using the indexes of the target frames of each selected target to obtain the attributes of each selected target;

and correspondingly outputting the position and the attribute of each selected target.

3. The method of claim 1, wherein the training of the multi-tasking detection model comprises:

acquiring a sample image set;

and for the sample images in the sample image set, if the positions and the attributes of the targets in the sample images are labeled simultaneously, taking the sample images as input, and taking the labeled positions and the labeled target attributes of the targets in the sample images as output to train the multi-task detection model.

4. The method of claim 3, wherein the training of the multi-tasking detection model further comprises:

and if the target in the sample image is only marked with a position, taking the sample image as input, taking the marked position of the target in the sample image as output, and training the position detection branch network in the multi-task detection model.

5. The method of claim 4, wherein the location detection branch network comprises a location detection network agent, a coordinate branch, and a scoring branch; and

the training of the position detection branch network in the multi-task detection model by taking the sample image as input and the labeled position of the target in the sample image as output comprises the following steps:

inputting the sample image to the position detection network main body to obtain the position characteristics of the sample image;

inputting the position characteristics of the sample image into the coordinate branch and the score branch respectively to obtain a predicted coordinate and a predicted score of a target frame in the sample image;

calculating a position loss function based on the predicted coordinates and the predicted score of the target frame in the sample image and the labeled position of the target;

adjusting a parameter of the location detection branch network based on the location loss function.

6. The method of claim 3, wherein the training of the multi-tasking detection model further comprises:

and if the target in the sample image is only labeled with the attribute, taking the sample image as input, taking the labeled attribute of the target in the sample image as output, and training an attribute prediction branch network in the multi-task detection model.

7. The method of claim 6, wherein the attribute prediction branch network comprises an attribute prediction network body and an attribute branch; and

the training of the attribute prediction branch network in the multi-task detection model by taking the sample image as input and the labeled attribute of the target in the sample image as output comprises the following steps:

inputting the sample image to the attribute prediction network main body to obtain the attribute characteristics of the sample image;

inputting the attribute characteristics of the sample image into the attribute branch to obtain the prediction attribute of the target in the sample image;

calculating an attribute loss function based on the prediction attribute and the annotation attribute of the target in the sample image;

adjusting parameters of the attribute-predicting branch network based on the attribute-loss function.

8. An image detection apparatus comprising:

an image input module configured to input an image to be detected to a pre-trained multi-task detection model, wherein the multi-task detection model comprises a position detection branch network and an attribute prediction branch network;

the position detection module is configured to process the image to be detected by utilizing the position detection branch network to obtain the position of a target in the image to be detected;

the attribute prediction module is configured to process the image to be detected by using the attribute prediction branch network to obtain the attribute of the target in the image to be detected;

and the corresponding output module is configured to correspondingly output the position and the attribute of at least part of the target in the image to be detected.

9. The apparatus of claim 8, wherein the location comprises coordinates, a score, and an index of the target box; and the corresponding output module is further configured to:

10. The apparatus of claim 8, wherein the apparatus further comprises a model training module comprising:

a sample acquisition sub-module configured to acquire a set of sample images;

and the first training submodule is configured to train the multi-task detection model by taking the sample image as input and taking the labeling position and the labeling target attribute of the target in the sample image as output if the position and the attribute of the target in the sample image are labeled simultaneously for the sample image in the sample image set.

11. The apparatus of claim 10, wherein the model training module further comprises:

and the second training submodule is configured to train the position detection branch network in the multi-task detection model by taking the sample image as input and the labeled position of the target in the sample image as output if the target in the sample image is only labeled with the position.

12. The apparatus of claim 11, wherein the location detection branch network comprises a location detection network agent, a coordinate branch, and a scoring branch; and

the second training submodule is further configured to:

13. The apparatus of claim 10, wherein the model training module further comprises:

and the third training sub-module is configured to train the attribute prediction branch network in the multi-task detection model by taking the sample image as input and the labeled attribute of the target in the sample image as output if the target in the sample image is only labeled with the attribute.

14. The apparatus of claim 13, wherein the attribute prediction branch network comprises an attribute prediction network body and an attribute branch; and

the third training submodule is further configured to:

15. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

16. A computer-readable medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, carries out the method according to any one of claims 1-7.