CN113627298A

CN113627298A - Training method of target detection model and method and device for detecting target object

Info

Publication number: CN113627298A
Application number: CN202110878323.0A
Authority: CN
Inventors: 张健
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2021-11-09

Abstract

The present disclosure provides a training method for a target detection model, a method, an apparatus, a device and a medium for detecting a target object, which relate to the field of artificial intelligence, in particular to computer vision and deep learning technology, and can be used in smart homes and smart city scenes. The training method comprises the steps that a sample image is input into a feature extraction network to obtain first feature data, wherein the sample image is provided with a label indicating a first actual position of a first target object in the sample image and an actual conditional probability of the first target object for a second target object; inputting the first characteristic data into a target detection network to obtain a predicted position of a first target object and a first occurrence probability of the first target object aiming at the first predicted position; inputting the first characteristic data into a conditional random field network to obtain a first conditional probability of a first target object for a second target object; and training the target detection model based on the first actual position, the predicted position, the first occurrence probability, the actual conditional probability and the first conditional probability.

Description

Training method of target detection model and method and device for detecting target object

Technical Field

The utility model relates to an artificial intelligence technical field, concretely relates to computer vision and deep learning technical field, can be applied to under intelligent house and the smart city scene.

Background

With the development of artificial intelligence, deep learning represented by a convolutional neural network is widely applied to the fields of target detection, image classification and the like. For example, in a gesture recognition scenario, a human hand may be used as a target to be detected to establish a target detection task.

Under the scene of a large number of people, the gesture command is usually interfered by hands of other people except the hand of the command publisher, and the accuracy of gesture recognition is difficult to ensure. Similar to the gesture recognition scene, detecting the scene of the sub-object that can move relative to the center point of the target object has the problem of inaccurate recognition.

Disclosure of Invention

The present disclosure provides a training method of a target detection model for improving detection accuracy, and a method, an apparatus, a device, and a storage medium for detecting a target object.

According to one aspect of the present disclosure, a training method of a target detection model is provided, wherein the target detection model comprises a feature extraction network, a target detection network and a conditional random field network; the method comprises the following steps: inputting a sample image into a feature extraction network to obtain first feature data, wherein the sample image is provided with a label, and the label indicates a first actual position of a first target object in the sample image and an actual conditional probability of the first target object for a second target object; inputting the first characteristic data into a target detection network to obtain a first predicted position of a first target object and a first occurrence probability of the first target object aiming at the first predicted position; inputting the first characteristic data into a conditional random field network to obtain a first conditional probability of a first target object for a second target object; and training the target detection model based on the first actual position, the first predicted position, the first occurrence probability, the actual probability and the second occurrence probability. The second target object comprises a first target object, and the first target object can move relative to the center of the second target object.

According to another aspect of the present disclosure, there is provided a method of detecting a target object using a target detection model, the target detection model comprising a feature extraction network, a target detection network and a conditional random field network; the method comprises the following steps: inputting an image to be detected into a feature extraction network to obtain second feature data; inputting the second characteristic data into a target detection network to obtain a third predicted position of the first target object and a third occurrence probability of the first target object aiming at the third predicted position; inputting the second characteristic data into the conditional random field network to obtain a second conditional probability of the first target object aiming at the second target object; determining a first target object included in the image to be detected based on the third occurrence probability and the second conditional probability; the second target object comprises a first target object, and the first target object can move relative to the center of the second target object; the target detection model is obtained by training by adopting the training method of the target detection model.

According to another aspect of the present disclosure, a training apparatus for a target detection model is provided, wherein the target detection model includes a feature extraction network, a target detection network, and a conditional random field network; the device includes: the first feature data obtaining module is used for inputting the sample image into a feature extraction network to obtain first feature data, wherein the sample image is provided with a label, and the label indicates a first actual position of a first target object in the sample image and an actual conditional probability of the first target object for a second target object; the first target detection module is used for inputting the first characteristic data into a target detection network to obtain a first predicted position of a first target object and a first occurrence probability of the first target object aiming at the first predicted position; the first probability obtaining module is used for inputting the first characteristic data into the conditional random field network to obtain a first conditional probability of the first target object aiming at the second target object; and the model training module is used for training the target detection model based on the first actual position, the first predicted position, the first occurrence probability, the actual conditional probability and the first conditional probability, wherein the second target object comprises a first target object, and the first target object can move relative to the center of the second target object.

According to another aspect of the present disclosure, there is provided an apparatus for detecting a target object using a target detection model, wherein the target detection model includes a feature extraction network, a target detection network, and a conditional random field network; the device includes: the second characteristic data obtaining module is used for inputting the image to be detected into the characteristic extraction network to obtain second characteristic data; the second target detection module is used for inputting the second characteristic data into the target detection network to obtain a third predicted position of the first target object and a third occurrence probability of the first target object aiming at the third predicted position; a second probability obtaining module, configured to input the second feature data into the conditional random field network, so as to obtain a second conditional probability of the first target object for the second target object; the object determination module is used for determining a first target object included in the image to be detected based on the third occurrence probability and the second conditional probability, wherein the second target object comprises the first target object, and the first target object can move relative to the center of the second target object; the target detection model is obtained by training by adopting the training device of the target detection model.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of training a target detection model and/or a method of detecting a target object using a target detection model provided by the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a training method of a target detection model and/or a method of detecting a target object using the target detection model provided by the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method of training an object detection model and/or the method of detecting an object using an object detection model provided by the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of an application scenario of a training method of a target detection model and a method and an apparatus for detecting a target object according to an embodiment of the present disclosure;

FIG. 2 is a flow diagram of a method of training a target detection model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a structure of a target detection model according to an embodiment of the present disclosure;

FIG. 4 is a schematic flow chart diagram of a method for detecting a target object using a target detection model according to an embodiment of the present disclosure;

FIG. 5 is a block diagram of a training apparatus for an object detection model according to an embodiment of the present disclosure;

FIG. 6 is a block diagram of an apparatus for detecting a target object using a target detection model according to an embodiment of the present disclosure; and

fig. 7 is a block diagram of an electronic device for implementing a training method of a target detection model and/or a method for detecting a target object using a target detection model according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The present disclosure provides a training method of a target detection model, wherein the target detection model comprises a feature extraction network, a target detection network and a conditional random field network. The training method comprises a characteristic extraction stage, a target detection stage, a probability prediction stage and a model training stage. In the feature extraction stage, a sample image is input into a feature extraction network, and first feature data is obtained, wherein the sample image has a label indicating a first actual position of a first target object and an actual conditional probability of the first target object for a second target object in the sample image. In the target detection stage, the first characteristic data is input into a target detection network, and a first predicted position of the first target object and a first occurrence probability of the first target object for the first predicted position are obtained. In a probability prediction stage, first characteristic data are input into a conditional random field network, and a first conditional probability of a first target object for a second target object is obtained. In the model training phase, a target detection model is trained based on a first actual position, a first predicted position, a first occurrence probability, an actual conditional probability and a first conditional probability. The second target object comprises a first target object, and the first target object can move relative to the center of the second target object.

An application scenario of the method and apparatus provided by the present disclosure will be described below with reference to fig. 1.

Fig. 1 is a schematic application scenario diagram of a training method of a target detection model, a method for detecting a target object, and an apparatus according to an embodiment of the present disclosure.

As shown in fig. 1, the scenario 100 of this embodiment includes a plurality of users 110, smart home devices 120, and a server 130. The smart home devices 120 may interact with the server 130 through the network to obtain data from the server 130 or send requests to the server 130.

The smart home device 120 may be, for example, a smart speaker, a smart refrigerator, a smart television, or other smart devices. The smart home device 120 may be provided with an image acquisition device for acquiring images of a human face, a human hand, and the like, so as to realize intelligent interaction with a user. In an embodiment, any electronic device having an image capturing device, such as a smart phone or a notebook computer, may be used to replace the smart home device 120.

When the smart home device 120 has a gesture recognition function, for example, the smart home device 120 may send the image 140 captured by the image capturing device to the server 130 in communication connection therewith, recognize the captured image 140 by the server 130, obtain a hand gesture in the captured image 140, and determine a gesture instruction based on the hand gesture. The server 130 may then determine response information in response to the gesture command and feed the response information back to the smart home device 120 as the recognition result 150. Alternatively, the server 130 may feed back the gesture instruction as the recognition result 150 to the smart home device 120. Or, the server 130 may feed back only the recognized hand position as the recognition result 150 to the smart home device 120, and the smart home device 120 determines the hand posture and the gesture instruction according to the hand position.

According to the embodiment of the present disclosure, when the captured image 140 is recognized, for example, a human hand may be used as a target object, and the human hand in the captured image 140 may be detected by using a target detection model. The target detection model may include, for example, a primary Feature model (YOLOF) that is Only viewed once, a Training-Time-Friendly Network model (TTFNET), and other models that can model a human hand. The present disclosure is not limited thereto.

The target detection model may be trained by the server 130 in advance, or may be trained by another server communicatively connected to the server 130, and the server 130 performs human hand detection using the target detection model.

According to the embodiment of the disclosure, as shown in fig. 1, when there are multiple users in the collection range of the image collection device of the smart home device 120, the collected image may include human hands of the multiple users. When one of the users sends an instruction through a gesture, a coherent plurality of gesture actions are usually required. In the process of sending the command, if the hand of another user swings out of a predetermined gesture that can be responded to, the gesture of sending the command cannot be accurately recognized, and there is a case of a false response. Based on this, the present disclosure expects to improve this situation by providing a target detection model that improves the accuracy of gesture recognition.

It should be noted that the training method of the object detection model provided in the present disclosure may be generally performed by the server 130, or another server communicatively connected to the server 130. Accordingly, the training apparatus of the object detection model provided by the present disclosure may be disposed in the server 130 or other servers communicatively connected to the server 130. The method for detecting a target object using a target detection model provided by the present disclosure may be performed by the server 130. Accordingly, the apparatus for detecting a target object using a target detection model provided by the present disclosure may be disposed in the server 130.

It should be understood that the number and type of terminals, roads, vehicles and communication base stations in fig. 1 are merely illustrative. There may be any data and types of terminals, roads, vehicles, and communication base stations, as desired for the implementation.

The training method of the target detection model provided by the present disclosure will be described in detail below with reference to fig. 1 through fig. 2 to 3 below.

As shown in fig. 2, the training method 200 of the target detection model of this embodiment may include operations S210 to S240. The target detection model comprises a feature extraction network, a target detection network and a conditional random field network. It will be appreciated that the trained target detection model may be used to identify smaller objects that are included in the larger object and that are capable of rotating relative to the center point of the larger object. For example, the trained target detection model may be used to identify a human hand, identify a camera on a pan/tilt head of an unmanned aerial vehicle, identify a boom of an unmanned aerial vehicle, identify a dump bucket of a mud-head vehicle, and the like, which is not limited by the present disclosure.

In operation S210, a sample image is input to a feature extraction network, resulting in first feature data.

According to the embodiment of the present disclosure, the feature extraction Network may adopt, for example, a Residual Neural Network (ResNet) or dark Network (DarkNet) framework, or the like. The ResNet may include ResNet50, ResNet 101, and the like, which is not limited in this disclosure.

The embodiment can input the sample image into the feature extraction network, and output the first feature data after being processed by the feature extraction network.

For example, the sample image may be an image to which a label is added in advance, and the image should include the first target object to be detected. The tag may indicate an actual location of the first target object. The actual position may be represented, for example, by the position of a two-dimensional bounding box, which may be added to the image to obtain a label for the image. The actual position may be represented by the position of the center point of the two-dimensional bounding box in a coordinate system established based on the sample image, and the height and width of the two-dimensional bounding box.

In an embodiment, the label of the sample image may further indicate an actual conditional probability between a second target object including the first target object and the first target object to represent a probability of the first target object occurring in case the second target object occurs. By means of the label indicating the actual conditional probability, an indication of the binding relationship between the first target object and the second target object may be achieved. For example, if the bounding box region of the second target object in the sample image includes the pixel position of the first target object, the actual conditional probability at that pixel position is 1, and the actual conditional probabilities at other pixel positions are 0. In the gesture recognition scenario, in the bounding box region representing the user a in the image, the actual conditional probability is 1 at the position of the human hand of the user a, and the conditional probabilities are 0 at other positions.

In operation S220, the first feature data is input into the target detection network, and a first predicted position of the first target object and a first occurrence probability of the first target object with respect to the first predicted position are obtained.

According to an embodiment of the present disclosure, the target detection network may be a codec network of a target detection model, and the target detection network may include two branches for predicting a first position of the first target object and a probability of the first target object appearing at the first position, respectively.

The embodiment can detect the coding network in the network by taking the first characteristic data as the target to obtain the coding characteristic. Then the coding characteristics are input into the decoding networks of the two branches, and the predicted position and the occurrence probability are obtained through the output of the two branches respectively. The predicted position may be represented by, for example, a center position of the bounding box of the first target object, and a height and a width of the bounding box of the first target object. The probability of occurrence may represent a probability that the object at the predicted position is the first target object.

In operation S230, the first feature data is input into the conditional random field network, and a first conditional probability of the first target object with respect to the second target object is obtained.

According to an embodiment of the present disclosure, first feature data may be input to a conditional random field network, the first conditional probability being output by the conditional random field network. The Conditional Random field network may be a CRF (Conditional Random Fields) Model in the related art, and is a Probabilistic Graph Model (PGM) as a whole. The conditional random field network may be a two-branch structure in parallel with the target detection network. Alternatively, the conditional random field network may share an encoding network with the target detection network, the decoding network of the conditional random field network and the decoding network of the target detection network being parallel to one another.

According to an embodiment of the present disclosure, the conditional random field network may be formed by sequentially connecting a plurality of convolutional layers, for example, and parameters such as the number of layers of the plurality of convolutional layers and the size of a convolutional kernel in each convolutional layer may be set according to actual requirements. For example, 5 convolutional layers may be provided, the size of the convolutional kernel in each convolutional layer is 3 × 3, and the number of the convolutional kernels in the convolutional layers may be set according to the size of the input feature data, which is not limited in this disclosure.

According to an embodiment of the present disclosure, the first conditional probability may represent, for example, a probability value of occurrence of the first target object included by the second target object at a position of the second target object in the image. In particular, it may represent a probability value of the occurrence of the first target object at the center position of the second target object.

In operation S240, a target detection model is trained based on the first actual position, the first predicted position, the first occurrence probability, the actual conditional probability, and the first conditional probability.

According to embodiments of the present disclosure, a penalty function may be assigned to the task of each branch in the target detection model. According to the embodiment, the value of the loss function distributed to each branch can be obtained according to the prediction result output by each branch and the actual result indicated by the label. And then, optimizing parameters of each network in the target detection model based on the value of the loss function by adopting a gradient descent algorithm, a back propagation algorithm and the like, so as to realize the training of the target detection model. It is to be understood that, according to actual needs, the loss function may be allocated only to the tasks of a part of branches in the target detection model, and the loss function may not be allocated to the tasks of other branches except for the part of branches, which is not limited by the present disclosure.

According to an embodiment of the present disclosure, for branches of a predicted location, the assigned penalty function may be a positioning penalty function. For branches that predict probability, the assigned penalty function may be a regression penalty function. The same or different regression loss functions may be used for the branch that obtains the first conditional probability and the branch that obtains the first occurrence probability, which is not limited in this disclosure.

According to the method and the device for detecting the target object, the conditional probability field network is arranged in the target detection model, and the whole model is trained on the basis of the conditional probability output by the conditional random field network, so that the target detection model obtained through training can learn the incidence relation between the first target object and the second target object. Therefore, when the target detection model obtained through training is used for detecting the target object, the first target object included by different second target objects can be effectively distinguished in the scene of a plurality of second target objects. For example, in a gesture recognition scene, the conditional probability obtained by the target detection model can be used to eliminate the first target object which is not bound to the second target object from the detected first target object. Therefore, the accuracy of gesture recognition can be improved, and the human-computer interaction experience is improved.

Fig. 3 is a schematic structural diagram of an object detection model according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, a heat map for a first target object may be output via a target detection model by inputting first feature data into the target detection model. Wherein the target detection network may generate a heat map over the first feature data using a gaussian kernel. Each point in the generated heat map indicates a probability that the center of the first target object is located at the respective point. The method of generating the heatmap using the Gaussian kernel may ensure that the target detection network produces a more robust activation value near the center of the first target object. Thus, the embodiment may determine a predicted center position of the first target object based on the peak point in the heat map, by which the first predicted position may be represented. At the same time, the probability indicated by the peak point may be determined as a first probability of occurrence of the first target object for the first predicted position.

According to an embodiment of the present disclosure, the first predicted position should include size information of the first target object in addition to the predicted center position. Thus, the accuracy of the obtained first predicted position is improved, and the accuracy of positioning the first target object is improved.

In one embodiment, as shown in fig. 3, the object detection model 300 of this embodiment includes a feature extraction network 310, a center location unit composed of an encoding network 321 and a decoding network 322, a size regression unit composed of an encoding network 321 and a decoding network 323, and a conditional random field network composed of an encoding network 321 and a decoding network 332. The central positioning unit and the size regression unit form a target detection network.

In one embodiment, the target detection model 300 may be constructed based on a TTFNET network model, and the target detection model 300 is different from the TTFNET network model in the related art in that a decoding network 332 is provided in the target detection model 300.

As such, the embodiment may input the sample image to the target detection model 300 when obtaining the first predicted position of the first target object and the first appearance probability of the first target object for the first predicted position. After being processed by the feature extraction network 310, the first feature data is obtained. The first signature data is then input into the encoding network 321 shared by the target detection network and the conditional random field network to obtain the encoding signature. The encoded features are input into a decoding network 322 in the central positioning unit, which may obtain a heatmap 301 for the first target object. The encoding feature is input to the decoding network 323 in the size regression unit, and the regression result 302 of the width and height of the bounding box of the first target object can be obtained, and the height and width obtained by the regression can be used as the predicted height and predicted width of the first target object. Inputting the coding features into a decoding network 332 in a conditional random field network, a conditional probability map 303 may be obtained, which conditional probability map 303 may indicate the probability of the first target object occurring at the pixel position where the second target object occurs.

Illustratively, the decoding network 323 may derive the distances from the center of the located first target object to the four edges of the bounding box by regressing the encoded features. The width and height of the bounding box of the first target object are thus obtained, which height, width and central position together constitute a first predicted position.

Illustratively, in the gesture recognition scenario, the conditional probability graph 303 may include nodes representing a center of a human body, a center of a left hand, and a center of a right hand, an edge pointing from the center of the human body to the center of the left hand, and an edge pointing from the center of the human body to the center of the right hand, which are used to represent a probability dependency, for example. After the conditional probability graph 303 is obtained, the conditional probability of the human hand for the human body can be calculated based on the probability dependency between the human body center and the human hand center.

It is understood that the heat map 301 and the regression results are obtained in a similar manner as the TTFNET model in the related art. The method for obtaining the conditional probability based on the conditional probability graph is similar to the method for obtaining the conditional probability by adopting a conditional random field model in the related art. And will not be described in detail herein.

In an embodiment, the target detection model may obtain a predicted position and an occurrence probability of the second target object in addition to the predicted position and the occurrence probability of the first target object. Specifically, the image feature data may be input into the target detection network, and a second predicted position of the second target object and a second occurrence probability of the second target object for the second predicted position may be obtained. The method of obtaining the predicted position and the occurrence probability of the second target object is similar to the aforementioned method of obtaining the predicted position and the occurrence probability of the first target object. That is, in the object detection model 300 shown in fig. 3, the object detection network can not only detect the first object but also detect the second object. For example, the object detection network may be provided with two detection channels for detecting a first object of interest and a second object of interest, respectively. The heat map output by the target detection network is the heat map of both detection channels.

Exemplarily, the label of the sample image may also indicate a second actual position of the second target object, for example. In this way, the target detection model may be trained based on the predicted position and the occurrence probability of the second target object obtained through prediction, so as to further improve the expression capability of the target detection model for the second target object, improve the accuracy of the obtained conditional probability, and improve the accuracy of the trained target detection model. Specifically, the target detection model may be trained based on a first actual position, a first predicted position, a first probability of occurrence, an actual conditional probability, a first conditional probability, a second actual position, and a second predicted position.

According to the embodiment of the disclosure, when the target detection model is trained, a loss function can be set for each output branch, so that the precision of each branch in the target detection model trained based on the loss function is improved.

In an embodiment, when the target detection model has branches that output a heat map for the first target object, branches that output regression results, and branches that output a conditional probability map, the method of training the target detection model may be implemented by the following flow. The value of the first regression loss sub-function in the predetermined loss function can be determined based on the first actual position, the first predicted position and the first occurrence probability to obtain a first value. And determining the value of a positioning loss sub-function in the preset loss function based on the first actual position and the first predicted position to obtain a second value. And determining the value of a second regression loss sub-function in the predetermined loss function based on the actual conditional probability and the first conditional probability to obtain a third value. And finally, training the target detection model based on the first value, the second value and the third value. For example, weights may be assigned to the three loss sub-functions in advance, and a weighted sum of the three values may be calculated based on the weights. And finally, training a target detection model by adopting a gradient descent algorithm or a back propagation algorithm and the like based on the weighted sum.

Wherein the actual center point of the first target object in the sample image may be determined based on the first actual position, the actual occurrence probability at the actual center point is set to 1, and the actual occurrence probability at the non-actual center is set to 0. Meanwhile, the probability at the center position of the first target object in the first prediction position is determined as a first occurrence probability, and the first occurrence probabilities at other positions are represented by probabilities indicated by other points in the heat map except for the peak point. The value of the first regression loss subfunction may be determined based on the difference between the actual occurrence probability and the predicted occurrence probability of each pixel point. The first regression Loss sub-function may be represented by, for example, an optimized focus Loss function (Modified focus Loss), or may also be represented by a mean square error Loss function, etc., which is not limited by the present disclosure.

Wherein an area intersection ratio of the actual bounding box and the predicted bounding box of the first target object may be determined based on the first actual position and the first predicted position. And then calculating the value of the positioning loss sub-function based on the intersection ratio. For example, different sampling weights may be assigned to different area intersection ratios of the bounding sums, and the contribution of the detection result of each target object to the loss may be balanced in turn. It is to be understood that the method of calculating the value of the localization loss sub-function is only used as an example to facilitate understanding of the present disclosure, and the present disclosure is not limited thereto.

The mean square error between the actual conditional probability and the first conditional probability may be used as the value of the second regression loss function. It is to be understood that the second regression loss function may be the same as or different from the first regression loss function, and the second regression loss function may also be a focus loss function, etc., which is not limited in this disclosure.

In an embodiment, after obtaining the values of the three loss functions, for example, a feature extraction network and a coding network shared by three branches in the target detection model may be trained based on a weighted sum of the three loss functions. Subsequently, values of three sub-loss functions related to the outputs of the three branches are adopted to train the decoding networks in the three branches respectively.

In an embodiment, when the target detection network detects the predicted position and the occurrence probability of the second target object at the same time, the value of the first regression loss sub-function may be determined by comprehensively considering the difference between the occurrence probability of the first target object and the actual occurrence probability and the difference between the occurrence probability of the second target object and the actual occurrence probability. Similarly, the value of the positioning loss sub-function can be determined by comprehensively considering the difference between the actual position and the predicted position of the first target object and the difference between the actual position and the predicted position of the second target object. The present disclosure does not limit this particular implementation.

Based on the model obtained by training the training method of the target detection model, the present disclosure also provides a method for detecting a target object by using the trained model, which will be described in detail below with reference to fig. 5.

Fig. 4 is a flowchart illustrating a method for detecting a target object using a target detection model according to an embodiment of the disclosure.

As shown in fig. 4, the method 400 of detecting a target object using a target detection model of this embodiment may include operations S410 to S440. The target detection model is obtained by training the training method of the target detection model described above. That is, the object detection model includes a feature extraction network, an object detection network, and a conditional random field network.

In operation S410, the image to be detected is input to the feature extraction network to obtain second feature data. The method for obtaining the second characteristic data is similar to the method for obtaining the first characteristic data described above, and is not described herein again.

In operation S420, the second feature data is input into the target detection network, and a third predicted position of the first target object and a third occurrence probability of the first target object with respect to the third predicted position are obtained. The method for obtaining the third predicted position and the third occurrence probability is similar to the method for obtaining the first predicted position and the first occurrence probability described above, and is not described herein again.

In operation S430, the second feature data is input into the conditional random field network, and a second conditional probability of the first target object with respect to the second target object is obtained. The second target object comprises a first target object, and the first target object can move relative to the center of the second target object. The method for obtaining the second conditional probability is similar to the method for obtaining the first conditional probability described above, and is not described herein again.

In operation S440, a first target object included in the image to be detected is determined based on the third occurrence probability and the second conditional probability.

According to an embodiment of the present disclosure, a target occurrence probability of the first target object may be determined based on the third occurrence probability and the second conditional probability. And determining the first target object with the target occurrence probability higher than a preset threshold value in a plurality of first target objects in the predicted image to be detected as the first target object included in the image to be detected.

For example, the average of the third occurrence probability and the second conditional probability may be used as the target occurrence probability of the first target object. Alternatively, the square root of the third occurrence probability and the second conditional probability may be used as the target occurrence probability. The method for obtaining the target occurrence probability is not limited in the present disclosure, as long as the target occurrence probability is positively correlated with the third occurrence probability, and the target occurrence probability is positively correlated with the second conditional probability. For example, the target occurrence probability may also be the product of the third occurrence probability and the second conditional probability. In the gesture recognition scenario, assume the center of the body is C_bThe center of the hand is C_hThen the conditional probability of the human body center and the human hand center obtained by prediction can be expressed as

(i.e., second conditional probability) while taking into account the probability of the center of the human hand being directly predicted

(i.e., third occurrence probability), the target occurrence probability of the human hand is obtained

Comprises the following steps:

according to the method and the device for detecting the first target object, the occurrence probability of the first target object for the second target object is obtained by adopting the target detection model, the predicted first target object is screened according to the occurrence probability, the first target object included in the image to be detected is obtained, and the accuracy of the detected first target object can be improved. Under the gesture recognition scene, the hands which do not belong to the instruction publisher can be removed conveniently, and the gesture recognition accuracy is improved.

Based on the training method of the target detection model, the disclosure also provides a training device of the target detection model. The apparatus will be described in detail below with reference to fig. 5.

Fig. 5 is a block diagram of a structure of a training apparatus for an object detection model according to an embodiment of the present disclosure.

As shown in fig. 5, the training apparatus 500 of the object detection model of this embodiment includes a first feature data obtaining module 510, a first object detection module 520, a first probability obtaining module 530, and a model training module 540. The target detection model comprises a feature extraction network, a target detection network and a conditional random field network.

The first feature data obtaining module 510 is configured to input the sample image into a feature extraction network to obtain first feature data. Wherein the sample image has a label indicating a first actual position of the first target object and an actual conditional probability of the first target object for the second target object in the sample image. In an embodiment, the first feature data obtaining module 510 may be configured to perform the operation S210 described above, which is not described herein again.

The first target detection module 520 is configured to input the first feature data into the target detection network, so as to obtain a first predicted position of the first target object and a first occurrence probability of the first target object for the first predicted position. In an embodiment, the first target detecting module 520 may be configured to perform the operation S220 described above, which is not described herein again.

The first probability obtaining module 530 is configured to input the first feature data into the conditional random field network to obtain a first conditional probability of the first target object with respect to the second target object. The second target object comprises a first target object, and the first target object can move relative to the center of the second target object. In an embodiment, the first probability obtaining module 530 may be configured to perform the operation S230 described above, and is not described herein again.

The model training module 540 is configured to train the target detection model based on the first actual position, the first predicted position, the first occurrence probability, the actual conditional probability, and the first conditional probability. In an embodiment, the model training module 540 may be configured to perform the operation S240 described above, which is not described herein again.

According to an embodiment of the present disclosure, the first object detection module 520 may include a heat map obtaining sub-module and a center determination sub-module. The heat map acquisition sub-module is configured to input the first feature data into the target detection network to obtain a heat map for the first target object, each point in the heat map indicating a probability that a center of the first target object is located at each point. The center determination sub-module is configured to determine a predicted center position of the first target object based on the peak point in the heat map and determine a probability indicated by the peak point as a first probability of occurrence.

According to an embodiment of the present disclosure, the target detection network may include a central positioning unit and a size regression unit. The first target detection module 520 may further include a size obtaining sub-module, configured to input the first feature data into the size regression unit to obtain a predicted height and a predicted width of the first target object. Wherein the heat map obtaining sub-module is configured to enter the first feature data into the central positioning unit to obtain the heat map.

According to an embodiment of the present disclosure, the first target detection module 520 is further configured to input the image feature data into a target detection network, and obtain a second predicted position of the second target object and a second occurrence probability of the second target object for the second predicted position. Wherein the tag further indicates a second actual position of the second target object. The model training module 540 is configured to train the target detection model based on the first actual position, the first predicted position, the first occurrence probability, the actual conditional probability, the first conditional probability, the second actual position, and the second predicted position.

According to an embodiment of the present disclosure, the model training module 540 may include a first value obtaining submodule, a second value obtaining submodule, a third value obtaining submodule, and a training submodule. The first value obtaining submodule is used for determining the value of a first return loss sub-function in the predetermined loss function based on the first actual position, the first predicted position and the first occurrence probability to obtain a first value. And the second value obtaining submodule is used for determining the value of the regression positioning subfunction in the predetermined loss function based on the first actual position and the first predicted position to obtain a second value. And the third value obtaining submodule is used for determining the value of a second regression loss sub-function in the predetermined loss function based on the actual conditional probability and the first conditional probability to obtain a third value. The training submodule is used for training the target detection model based on the first value, the second value and the third value.

According to an embodiment of the present disclosure, the target detection model includes a training time-friendly network model.

Based on the method for detecting the target object by adopting the target detection model, the disclosure also provides a device for detecting the target object by adopting the target detection model. The apparatus will be described in detail below with reference to fig. 7.

Fig. 6 is a block diagram of an apparatus for detecting a target object using a target detection model according to an embodiment of the present disclosure.

As shown in fig. 6, the apparatus 600 for detecting a target object using a target detection model may include a second feature data obtaining module 610, a second target detection module 620, a second probability obtaining module 630, and an object determination module 640. The target detection model comprises a feature extraction network, a target detection network and a conditional random field network. The target detection model may be obtained by training with a training apparatus using the target detection model described above, for example.

The second feature data obtaining module 610 is configured to input the image to be detected into the feature extraction network to obtain second feature data. In an embodiment, the second feature data obtaining module 610 may be configured to perform the operation S410 described above, which is not described herein again.

The second target detection module 620 is configured to input the second feature data into the target detection network, so as to obtain a third predicted position of the first target object and a third occurrence probability of the first target object for the third predicted position. In an embodiment, the second target detecting module 620 may be configured to perform the operation S420 described above, which is not described herein again.

The second probability obtaining module 630 is configured to input the second feature data into the conditional random field network to obtain a second conditional probability of the first target object for the second target object. Wherein a second target object includes the first target object, and the first target object is movable relative to a center of the second target object. In an embodiment, the second probability obtaining module 730 may be configured to perform the operation S430 described above, which is not described herein again.

The object determining module 640 is configured to determine a first target object included in the image to be detected based on the third occurrence probability and the second conditional probability. In an embodiment, the object determining module 640 may be configured to perform the operation S440 described above, which is not described herein again.

According to an embodiment of the present disclosure, the object determination module 640 may include a probability determination sub-module and an object determination sub-module. The probability determination submodule is used for determining the target occurrence probability of the first target object based on the third occurrence probability and the second conditional probability. The object determination submodule is used for determining that the first target object with the target occurrence probability higher than the preset threshold is the first target object included in the image to be detected.

It should be noted that, in the technical solution of the present disclosure, the acquisition, storage, application, and the like of the personal information of the related user all conform to the regulations of the relevant laws and regulations, and do not violate the common customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement the training methods of the target detection models and/or the methods of detecting target objects using the target detection models of the embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the above-described respective methods and processes, such as a training method of the target detection model and/or a method of detecting the target object using the target detection model. For example, in some embodiments, the training method of the target detection model and/or the method of detecting the target object using the target detection model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When loaded into RAM 703 and executed by the computing unit 701, may perform one or more steps of the method of training the object detection model and/or the method of detecting an object using the object detection model described above. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g. by means of firmware) to perform a training method of the target detection model and/or a method of detecting a target object using the target detection model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and a VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A training method of a target detection model is provided, wherein the target detection model comprises a feature extraction network, a target detection network and a conditional random field network; the method comprises the following steps:

inputting a sample image into the feature extraction network, obtaining first feature data, wherein the sample image has a label indicating a first actual position of a first target object in the sample image and an actual conditional probability of the first target object for a second target object;

inputting the first characteristic data into the target detection network to obtain a first predicted position of a first target object and a first occurrence probability of the first target object aiming at the first predicted position;

inputting the first feature data into the conditional random field network to obtain a first conditional probability of the first target object for the second target object; and

training the target detection model based on the first actual position, the first predicted position, the first probability of occurrence, the actual conditional probability, and the first conditional probability,

wherein the second target object includes the first target object, and the first target object is movable relative to a center of the second target object.

2. The method of claim 1, wherein deriving a first predicted location of a first target object and a first probability of occurrence of the first target object for the first predicted location comprises:

inputting the first feature data into the target detection network, resulting in a heat map for the first target object, each point in the heat map indicating a probability that a center of the first target object is located at the each point; and

determining a predicted center position of the first target object based on a peak point in the heat map, and determining a probability indicated by the peak point as the first probability of occurrence.

3. The method of claim 2, wherein the target detection network comprises a central positioning unit and a size regression unit; obtaining a first predicted position of a first target object and a first probability of occurrence of the first target object for the first predicted position further comprises:

inputting the first feature data into the size regression unit to obtain a predicted height and a predicted width of the first target object,

wherein the heat map is obtained by inputting the first feature data into the central positioning unit.

4. The method of claim 1, further comprising:

inputting the image characteristic data into the target detection network to obtain a second predicted position of the second target object and a second occurrence probability of the second target object for the second predicted position;

wherein the tag further indicates a second actual location of the second target object; training the target detection model comprises:

training the target detection model based on the first actual position, the first predicted position, the first probability of occurrence, the actual conditional probability, the first conditional probability, the second actual position, and the second predicted position.

5. The method of claim 1, wherein training the target detection model comprises:

determining a value of a first regression loss sub-function in a predetermined loss function based on the first actual position, the first predicted position and the first occurrence probability to obtain a first value;

determining a value of a positioning loss sub-function in a predetermined loss function based on the first actual position and the first predicted position to obtain a second value;

determining a value of a second regression loss sub-function in the predetermined loss function based on the actual conditional probability and the first conditional probability to obtain a third value; and

and training the target detection model based on the first value, the second value and the third value.

6. The method of any of claims 1-5, wherein the target detection model comprises a training time-friendly network model.

7. A method for detecting a target object using a target detection model, wherein the target detection model comprises a feature extraction network, a target detection network and a conditional random field network; the method comprises the following steps:

inputting an image to be detected into the feature extraction network to obtain second feature data;

inputting the second feature data into the target detection network to obtain a third predicted position of the first target object and a third occurrence probability of the first target object for the third predicted position;

inputting the second feature data into the conditional random field network to obtain a second conditional probability of the first target object for a second target object; and

determining a first target object included in the image to be detected based on the third probability of occurrence and the second conditional probability,

wherein the second target object includes the first target object, and the first target object is movable relative to a center of the second target object; the target detection model is obtained by training by adopting the method of any one of claims 1-6.

8. The method of claim 7, wherein determining the first target object comprised by the image to be detected comprises:

determining a target probability of occurrence for the first target object based on the third probability of occurrence and the second conditional probability; and

and determining the first target object with the target occurrence probability higher than a preset threshold value as the first target object included in the image to be detected.

9. A training device of a target detection model, wherein the target detection model comprises a feature extraction network, a target detection network and a conditional random field network; the device comprises:

a first feature data obtaining module, configured to input a sample image into the feature extraction network, so as to obtain first feature data, where the sample image has a label indicating a first actual position of a first target object in the sample image and an actual conditional probability of the first target object for a second target object;

the first target detection module is used for inputting the first characteristic data into the target detection network to obtain a first predicted position of a first target object and a first occurrence probability of the first target object aiming at the first predicted position;

a first probability obtaining module for inputting the first feature data into the conditional random field network to obtain a first conditional probability of the first target object with respect to the second target object; and

a model training module to train the target detection model based on the first actual position, the first predicted position, the first probability of occurrence, the actual conditional probability, and the first conditional probability,

10. The apparatus of claim 9, wherein the first target detection module comprises:

a heat map obtaining sub-module configured to input the first feature data into the target detection network, to obtain a heat map for the first target object, where each point in the heat map indicates a probability that a center of the first target object is located at the each point; and

a center determination sub-module configured to determine a predicted center position of the first target object based on a peak point in the heat map, and determine a probability indicated by the peak point as the first probability of occurrence.

11. The apparatus of claim 10, wherein the target detection network comprises a central positioning unit and a size regression unit; the first target detection module further comprises:

a size obtaining submodule for inputting the first feature data into the size regression unit to obtain a predicted height and a predicted width of the first target object,

wherein the heat map obtaining sub-module is configured to input the first feature data into the central positioning unit to obtain the heat map.

12. The apparatus of claim 9, wherein the first target detection module is further configured to:

inputting the image feature data into the target detection network to obtain a second predicted position of the second target object and a second occurrence probability of the second target object for the second predicted position,

wherein the tag further indicates a second actual location of the second target object; the model training module is configured to train the target detection model based on the first actual position, the first predicted position, the first occurrence probability, the actual conditional probability, the first conditional probability, the second actual position, and the second predicted position.

13. The apparatus of claim 9, wherein the model training module comprises:

a first value obtaining sub-module, configured to determine a value of a first regression loss sub-function in a predetermined loss function based on the first actual position, the first predicted position, and the first occurrence probability, to obtain a first value;

a second value obtaining submodule, configured to determine a value of a regression positioning sub-function in the predetermined loss function based on the first actual position and the first predicted position, so as to obtain a second value;

a third value obtaining submodule, configured to determine a value of a second regression loss sub-function in the predetermined loss function based on the actual conditional probability and the first conditional probability, and obtain a third value; and

and the training submodule is used for training the target detection model based on the first value, the second value and the third value.

14. The apparatus of any of claims 9-13, wherein the target detection model comprises a trained time-friendly network model.

15. An apparatus for detecting a target object using a target detection model, wherein the target detection model comprises a feature extraction network, a target detection network, and a conditional random field network; the device comprises:

the second characteristic data acquisition module is used for inputting the image to be detected into the characteristic extraction network to obtain second characteristic data;

the second target detection module is used for inputting the second feature data into the target detection network to obtain a third predicted position of the first target object and a third occurrence probability of the first target object aiming at the third predicted position;

a second probability obtaining module, configured to input the second feature data into the conditional random field network, so as to obtain a second conditional probability of the first target object for a second target object; and

an object determination module, configured to determine a first target object included in the image to be detected based on the third occurrence probability and the second conditional probability,

wherein the second target object includes the first target object, and the first target object is movable relative to a center of the second target object; the target detection model is obtained by training by adopting the device of any one of claims 9-14.

16. The apparatus of claim 15, wherein the object determination module comprises:

a probability determination submodule for determining a target occurrence probability of the first target object based on the third occurrence probability and the second conditional probability; and

and the object determining submodule is used for determining that the first target object with the target occurrence probability higher than a preset threshold value is the first target object included in the image to be detected.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements a method according to any one of claims 1 to 8.