CN115861400A

CN115861400A - Target object detection method, training method and device and electronic equipment

Info

Publication number: CN115861400A
Application number: CN202310113169.7A
Authority: CN
Inventors: 邹智康; 叶晓青
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-02-15
Filing date: 2023-02-15
Publication date: 2023-03-28
Anticipated expiration: 2043-02-15
Also published as: CN115861400B

Abstract

The application provides a target object detection method, a target object training device and electronic equipment, relates to the technical field of artificial intelligence such as computer vision, image processing and deep learning, and can be applied to scenes such as automatic driving and smart cities. The specific implementation scheme is as follows: carrying out depth information prediction on a target object in an image to be detected to obtain initial key point depth of key points of the target object and a depth information confidence coefficient corresponding to the initial key point depth; determining the target prediction depth of the target object according to the depth information confidence coefficient and the initial key point depth; and detecting the target object in the image to be detected according to the target prediction depth.

Description

Target object detection method, training method and device and electronic equipment

Technical Field

The application relates to the technical field of artificial intelligence such as computer vision, image processing and deep learning, and can be applied to automatic driving, smart cities and other scenes.

Background

In application scenes such as automatic driving and intelligent traffic, images of a space to be detected can be collected, target objects such as vehicles and traffic signboards in the space can be detected according to the collected images, and intelligent functions such as automatic driving of the vehicles and identification of abnormal traffic conditions can be achieved according to detection results of the target objects.

Disclosure of Invention

The application provides a target object detection method, a deep learning model training device, an electronic device, a storage medium and a program product.

According to an aspect of the present application, there is provided a target object detection method, including: carrying out depth information prediction on a target object in an image to be detected to obtain initial key point depth of key points of the target object and a depth information confidence coefficient corresponding to the initial key point depth; determining the target prediction depth of the target object according to the depth information confidence and the initial key point depth; and detecting the target object in the image to be detected according to the target prediction depth.

According to another aspect of the present application, there is provided a training method of a deep learning model, including: inputting a sample image to be detected into an initial deep learning model, and outputting sample initial key point depth of sample key points of a sample target object and a sample depth information confidence degree corresponding to the sample initial key point depth in the sample image; determining a predicted three-dimensional detection frame of the sample target object according to the sample initial key point depth and the sample two-dimensional attribute of the sample target object; and training the initial deep learning model by utilizing the label three-dimensional detection frame corresponding to the sample target object, the prediction three-dimensional detection frame and the sample depth information confidence coefficient to obtain a trained deep learning model.

According to another aspect of the present application, there is provided a target object detecting apparatus including: the prediction module is used for predicting depth information of a target object in an image to be detected to obtain initial key point depth of key points of the target object and a depth information confidence coefficient corresponding to the initial key point depth; a target prediction depth determination module, configured to determine a target prediction depth of the target object according to the depth information confidence and the initial keypoint depth; and the detection module is used for detecting the target object in the image to be detected according to the target prediction depth.

According to another aspect of the present application, there is provided a training apparatus for deep learning models, including: the sample image processing module is used for inputting a sample image to be detected into the initial deep learning model and outputting the sample initial key point depth of the sample key points of the sample target object in the sample image and the sample depth information confidence corresponding to the sample initial key point depth; the predicted three-dimensional detection frame determining module is used for determining a predicted three-dimensional detection frame of the sample target object according to the sample initial key point depth and the sample two-dimensional attribute of the sample target object; and the training module is used for training the initial deep learning model by utilizing the label three-dimensional detection frame corresponding to the sample target object, the prediction three-dimensional detection frame and the sample depth information confidence coefficient to obtain a trained deep learning model.

According to another aspect of the present application, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to another aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.

According to another aspect of the application, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method as described above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application.

Fig. 1 schematically shows an exemplary system architecture to which the target object detection method and apparatus may be applied according to an embodiment of the present application.

Fig. 2 schematically shows a flow chart of a target object detection method according to an embodiment of the application.

Fig. 3 schematically shows a flowchart of depth information prediction of a target object in an image to be detected according to an embodiment of the present application.

Fig. 4 schematically shows a flow chart of a target object detection method according to another embodiment of the present application.

Fig. 5 schematically illustrates an application scene diagram for determining a plurality of key points of a target object according to a central point of a target object central point thermodynamic diagram according to an embodiment of the present application.

Fig. 6 schematically shows an application scenario diagram according to a target object detection method according to an embodiment of the present application.

Fig. 7 schematically shows a flowchart of a training method of a deep learning model according to an embodiment of the present application.

Fig. 8 schematically shows a block diagram of a target object detection apparatus according to an embodiment of the present application.

Fig. 9 schematically shows a block diagram of a training apparatus for deep learning models according to an embodiment of the present application.

Fig. 10 schematically shows a block diagram of an electronic device adapted to implement a target object detection method or a training method of a deep learning model according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the technical scheme of the application, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, necessary security measures are taken, and the public order and the custom are not violated. Accordingly, before the personal information of the related user is acquired, the user is informed of the purpose of the information required to be acquired, and the information is acquired after the authorization of the user is acquired.

The application provides a target object detection method, a training method and device of a deep learning model, electronic equipment, a storage medium and a program product.

According to an embodiment of the present application, a target object detection method includes: carrying out depth information prediction on a target object in an image to be detected to obtain initial key point depth of key points of the target object and a depth information confidence coefficient corresponding to the initial key point depth; determining the target prediction depth of the target object according to the depth information confidence coefficient and the initial key point depth; and detecting the target object in the image to be detected according to the target prediction depth.

According to the embodiment of the application, the initial key point depth and the depth information confidence of the key points of the target object can be obtained by detecting the image to be detected, and the target prediction depth representing the depth position of the target object is further determined according to the depth information confidence and the initial key point depth, so that the detection precision of the depth information of the target object in the image to be detected can be improved, the target object in the image to be detected is detected according to the target prediction depth, and the technical effect of improving the detection precision of the target object can be at least realized.

It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present application may be applied to help those skilled in the art understand the technical content of the present application, and does not mean that the embodiments of the present application may not be applied to other devices, systems, environments or scenarios. For example, in another embodiment, an exemplary system architecture to which the content processing method and apparatus may be applied may include a terminal device, but the terminal device may implement the content processing method and apparatus provided in the embodiments of the present application without interacting with a server.

As shown in fig. 1, the system architecture 100 according to this embodiment may include

terminal devices

101, 102, a vehicle 103, a network 104, and a server 105. The network 104 is used to provide a medium of communication links between the

terminal devices

101, 102, the vehicle 103 and the server 105. Network 104 may include various connection types, such as wired and/or wireless communication links, and so forth.

The user may interact with the server 105 through the network 104 using the

terminal devices

101, 102 to receive or send messages or the like, or the user may operate the vehicle 103 to interact with the server 105 through the network 104 to receive or send messages or the like. The

terminal devices

101, 102, and the vehicle 103 may have various messaging client applications installed thereon, such as a knowledge reading application, a web browser application, a search application, an instant messaging tool, a mailbox client, and/or social platform software, etc. (by way of example only).

The

terminal devices

101, 102, the vehicle 103 may be various electronic devices having display screens and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

It should be noted that an image acquisition device for acquiring an image to be detected may be installed on the vehicle 103, or the vehicle may further acquire the image to be detected through a wireless communication link such as bluetooth or a wireless network.

The server 105 may be a server that provides various services, such as a background management server (for example only) that provides support for content browsed by the user using the

terminal devices

101, 102, the vehicle 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that the target object detection method provided in the embodiment of the present application may be generally executed by the

terminal devices

101 and 102 or the vehicle 103. Accordingly, the target object detection apparatus provided in the embodiment of the present application may also be disposed in the

terminal devices

101 and 102 or the vehicle 103.

Alternatively, the target object detection method provided in the embodiment of the present application may also be generally executed by the server 105. Accordingly, the target object detection apparatus provided in the embodiment of the present application may be generally disposed in the server 105. The target object detection method provided by the embodiment of the present application may also be executed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101 and 102, the vehicle 103, and/or the server 105. Accordingly, the target object detection apparatus provided in the embodiment of the present application may also be disposed in a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101 and 102, the vehicle 103, and/or the server 105.

It should be understood that the number of terminal devices, vehicles, networks, and servers in fig. 1 are merely illustrative. There may be any number of terminal devices, vehicles, networks, and servers, as desired for implementation.

As shown in FIG. 2, the method includes operations S210-S230.

In operation S210, depth information prediction is performed on a target object in an image to be detected, so as to obtain an initial keypoint depth of a keypoint of the target object and a depth information confidence corresponding to the initial keypoint depth.

In operation S220, a target predicted depth of the target object is determined according to the depth information confidence and the initial keypoint depth.

In operation S230, a target object in an image to be detected is detected according to a target prediction depth.

According to the embodiment of the application, the image to be detected can be obtained after image acquisition is carried out on the space to be detected through image acquisition devices such as a camera, for example, the image acquisition can be carried out on the space to be detected through a monocular camera, and the image to be detected is acquired. The target object in the image to be detected can comprise any type of target object such as a vehicle, a traffic signboard and the like in the space to be detected.

It should be noted that the number of the target objects in the image to be detected may be 1 or multiple, and the number of the target objects in the image to be detected is not limited in the embodiments of the present application.

According to an embodiment of the present application, the key point of the target object may include any type of point on the target object in the space to be detected, for example, a point on an edge of a two-dimensional detection frame representing the target object or any point within the two-dimensional detection frame. Accordingly, the number of the key points may be one or more, and the embodiment of the present application does not limit the type of the key points and/or the number of the key points, and a person skilled in the art may select the key points according to actual needs as long as the target object can be characterized.

According to the embodiment of the present application, the initial key point depth may include a distance between a key point and an image acquisition device that acquires an image to be detected, but is not limited thereto, and may also include a distance between a target object detection apparatus such as a vehicle on which the image acquisition device is mounted and the key point.

According to the embodiment of the application, the depth information confidence coefficient can represent the corresponding initial key point depth, and the contribution degree of the depth information confidence coefficient to the depth position of the predicted target object is achieved, so that the target predicted depth determined according to the depth information confidence coefficient and the initial key point depth can correspondingly improve the prediction accuracy of the depth information of the target object.

According to the embodiment of the application, the detection of the target object in the image to be detected can be the detection of the position of the target object in the space to be detected according to the target prediction depth, or the detection results of classification, moving speed and the like of the target can be detected according to the target prediction depth and other attribute information representing the target object.

According to the embodiment of the application, the initial key point depth and the depth information confidence of the key points of the target object can be obtained by detecting the image to be detected, the target prediction depth representing the depth position of the target object is further determined according to the depth information confidence and the initial key point depth, the technical problem of low depth information accuracy caused by the fact that the depth of the center point of the target object is used as the depth information of the target object in the related technology can be at least partially solved, the detection precision of the depth information of the target object in the image to be detected is improved, the target object in the image to be detected is detected according to the target prediction depth, and the technical effect of improving the detection precision of the target object can be at least achieved.

In any embodiment of the present application, the image to be detected may be obtained in various public and legal compliance manners, for example, the image obtained by acquiring the image of the space to be detected corresponding to the authorization information after obtaining the authorization of the relevant mechanism or the user, or the target object detection method in the embodiment of the present application may be executed by a mechanism or a user having a relevant image acquisition authority and an image analysis authority.

The method shown in fig. 2 is further described with reference to fig. 3 to 6 in conjunction with the specific embodiment.

According to an embodiment of the present application, the key point of the target object includes a plurality.

As shown in FIG. 3, the depth information prediction of the target object in the image to be detected in operation S210 may include operations S310 to S320.

In operation S310, an image to be detected is input to a semantic feature extraction layer of a target object detection model, and semantic features of the image are output, where the target object detection model further includes a key point depth prediction layer and a confidence prediction layer.

In operation S320, the image semantic features are respectively input to the keypoint depth prediction layer and the confidence degree prediction layer, and initial keypoint depths corresponding to the multiple keypoints of the target object and depth information confidence degrees corresponding to the multiple initial keypoint depths are output.

According to the embodiment of the application, the semantic feature extraction Layer may include a neural network Layer constructed based on a neural network algorithm, for example, the semantic feature extraction Layer may be constructed based on a Deep Layer fusion (DLA) algorithm, but is not limited thereto, and the semantic feature extraction Layer may also be constructed based on a Residual network (ResNet) algorithm.

According to the embodiment of the application, a semantic feature extraction layer can be constructed based on a Backbone (Backbone) network layer in a target detection algorithm to extract high-dimensional image semantic features in an image to be detected, so that the accuracy of the depth of an initial key point and/or the confidence of depth information which are subsequently output is improved.

According to the embodiment of the application, the image semantic features can comprise detection point thermodynamic diagrams, and the corresponding multiple key points can be determined by setting supervision identifications for the detection points in the detection point thermodynamic diagrams.

It should be noted that, in the embodiment of the present application, a specific manner for determining the plurality of key points is not limited, for example, the plurality of key points may be determined according to a manner of manually setting a supervision identifier, and a person skilled in the art may select a specific manner for determining the plurality of key points according to actual requirements.

According to the embodiment of the application, the key point Depth prediction layer may be constructed based on a Depth detection header (Depth Head) in the related art, for example, the key point Depth prediction layer may be constructed based on a Fully connected layer (full connected layers).

According to the embodiment of the application, the confidence prediction layer can be constructed based on a neural network algorithm in the related technology, and the depth information confidence output by the confidence prediction layer can have an incidence relation with the corresponding initial key point depth, so that the target prediction depth can be determined conveniently according to the associated initial key point depth and the depth information confidence.

For example, in the case where the number of key points is 3, the target prediction depth may be determined based on the following formula (1).

D=d ₁ ·u ₁ + d ₂ ·u ₂ + d ₃ ·u ₃ ；（1）

In formula (1), D may represent a target predicted depth, D ₁ 、d ₂ 、d ₃ Respectively representing the initial keypoint depth, u, of each of the keypoints G1, G2, G3 ₁ 、u ₂ 、u ₃ Respectively representing initial keypoint depth d ₁ 、d ₂ 、d ₃ Respective corresponding depth information confidence.

According to the embodiment of the application, the depth information confidence is used as the weight parameter of the corresponding initial key point depth, and the depth confidence information is multiplied by the initial key point depth to determine the target key point depth corresponding to the key point, so that the target key point depth can represent the contribution degree (confidence degree) of the key point to the output target prediction depth, and the target prediction depth of the target object can be determined based on a plurality of target key point depths. Compared with a related detection method for detecting the target object based on the depth information of the central point in the related technology, the method has the advantages that the error between the target prediction depth and the real depth of the target object can be reduced, the detection accuracy of the depth information of the target object is improved, and the technical effect of improving the detection accuracy of the subsequent detection of the target object is realized.

It should be noted that the target object detection model provided in the embodiment of the present application may be obtained after being trained by a related training method, and the embodiment of the present application does not limit a specific training method for training the target object detection model, for example, the target object detection model may be trained based on a gradient descent algorithm, but is not limited thereto, and the target object detection model may also be obtained by being trained based on other training methods.

In the embodiment of the present application, the target object detection model is not a detection model for a specific user, and is not used for detecting personal information of a specific user. The detection of the target object detection model can be executed after being authorized by a user, or the target object detection can be executed after being confirmed or authorized by an organization or a user with related detection authority, and the detection process conforms to related laws and regulations.

As shown in FIG. 4, the target object detection method may further include operations S410-S420.

In operation S410, two-dimensional attributes of the target object are determined according to the image semantic features, wherein the two-dimensional attributes of the target object include a target object center point thermodynamic diagram.

In operation S420, a plurality of key points of the target object are determined according to a center point corresponding to the target object in the target object center point thermodynamic diagram.

According to an embodiment of the present application, a target object center point thermodynamic diagram (heatmap) may include a plurality of points for characterizing a target object, for example, may include pixel points with different colors. The respective color of the pixel points can represent the detection depth corresponding to the pixel points, and the plurality of pixel points can include the center point of the detected target object, namely the center point of the thermodynamic diagram of the center point of the target object. And screening other key points from the target object center point thermodynamic diagram through the center point, thereby obtaining a plurality of key points containing the center point. For example, one or more pixel points closer to the center point in the thermodynamic diagram of the center point of the target object may be determined as the key points.

It should be understood that the obtained plurality of key points may also include respective location information of the plurality of key points, such as coordinates of the key points, and accordingly, the corresponding other key points may be determined by the location information of the central point.

According to the embodiment of the application, the plurality of key points of the target object are determined from the thermodynamic diagram of the central point of the target object according to the central point, so that the plurality of key points have the similar attributes to the central point of the target object, and the depth position of the target object can be represented from multiple dimensions through the initial key point depth and the depth information confidence of the plurality of key points obtained by the target object detection method. Compared with the method that the depth position of the target object is represented only through the depth of the central point, the method and the device for detecting the depth of the target object can correct errors caused by central point detection errors or inaccurate depth detection of the central point by comprehensively considering the initial key point depth and the depth information confidence of a plurality of key points, so that the depth detection accuracy of the target object in the two-dimensional image to be detected is further improved, and the technical effect of improving the target object detection accuracy is achieved.

It should be noted that, an attribute detection head (also referred to as a detection branch) included in the target object detection model may process the image semantic features and then output a target object center point thermodynamic diagram, or may process the image semantic features through another detection model other than the target object detection model to obtain the target object center point thermodynamic diagram.

According to an embodiment of the application, the two-dimensional properties of the target object further comprise at least one of:

category attribute, orientation angle attribute, size attribute.

According to an embodiment of the present application, the category attribute may include a classification result for a target object in an image to be detected, and may include a classification result for a vehicle, a traffic signboard, and the like, for example.

According to the embodiment of the application, the size attribute may be used to represent the size of the target object in the image to be detected, and may include, for example, size information such as the length and the width of a detection frame corresponding to the target object.

According to an embodiment of the present application, in operation S420, determining a plurality of key points of the target object according to the central point corresponding to the target object in the target object central point thermodynamic diagram may include the following operations.

Screening out an adjacent key point corresponding to the central point from the thermodynamic diagram of the central point of the target object according to the central point and a static screening threshold value; and determining neighboring keypoints and the center point as a plurality of keypoints of the target object.

According to the embodiment of the application, the adjacent key points can be pixel points which are adjacent to the attribute of the central point in the thermodynamic diagram of the central point of the target object, for example, pixel points of which the pixel colors are adjacent to the pixel color of the central point, the static screening threshold can be a preset screening threshold, the adjacent key points are screened out through the static screening threshold, the screening speed for the adjacent key points can be improved, and the overall efficiency of target object detection is improved.

According to the embodiment of the application, the static screening threshold value can be determined based on the color difference value between the central point and other pixel points in the thermodynamic diagram of the central point of the target object, or the static screening threshold value can be determined in other manners.

According to an embodiment of the application, the static screening threshold comprises at least one of:

a static distance screening threshold value and a static key point quantity screening threshold value.

According to the embodiment of the application, by setting the static distance screening threshold, the pixel points which are in the static distance screening threshold range from the central point in the thermodynamic diagram of the central point of the target object can be determined as the adjacent key points, so that the key points close to the central point can be screened out, and the technical problem of low target object detection precision caused by the depth position detection error of the central point is at least partially solved.

According to the embodiment of the application, the static key point quantity screening threshold value can represent the quantity of the adjacent key points, the adjacent key points are searched from the central point through the static key point quantity screening threshold value until the quantity of the adjacent key points reaches the static key point quantity screening threshold value, so that the quantity of the key points can be limited by setting the quantity of the static key point quantity screening threshold value, and the technical problem of low target object detection efficiency caused by the fact that the number of the determined key points is large is at least partially solved.

According to the embodiment of the application, the static screening threshold value can be determined by combining the static distance screening threshold value and the static key point quantity screening threshold value, namely, a plurality of key points with the quantity matched with the static key point quantity screening threshold value are screened out in the range of the central point away from the static distance screening threshold value, so that the efficiency of screening to obtain the key points is further improved, and the detection efficiency of subsequent target objects is improved.

As shown in fig. 5, the application scenario may include a target object center point thermodynamic diagram 500, where the target object center point thermodynamic diagram 500 may include a center point 510 corresponding to a target object, and neighboring

key points

521, 522, 523, 524, 525, 526, 527, and 528 may be determined from pixel points adjacent to the center point 510 by using a static distance filtering threshold and a static key point quantity filtering threshold. Therefore, other key points adjacent to the central point 510 can be quickly and accurately screened out, and it can be determined that the plurality of key points corresponding to the target object may include the central point 510 and adjacent

key points

521, 522, 523, 524, 525, 526, 527 and 528.

According to the embodiment of the application, the pixel point adjacent to the central point in the thermodynamic diagram of the central point of the target object can be determined as the adjacent key point, so that the speed of determining the key point is further increased, and the detection efficiency of subsequent target object detection is improved.

Determining a dynamic screening threshold value corresponding to the target object according to the two-dimensional attribute of the target object; and determining a plurality of key points of the target object according to the central point and the dynamic screening threshold.

According to an embodiment of the present application, the dynamic filtering threshold may include any one or more of a dynamic distance filtering threshold, a dynamic key point number filtering threshold. The dynamic screening threshold may be a screening threshold corresponding to the two-dimensional attribute, and the dynamic screening threshold may be dynamically changed according to different attribute information or attribute values of the two-dimensional attribute of the target object, so as to determine the dynamic screening threshold adapted to the target object, and improve screening accuracy for the neighboring key points.

For example, in the case where the two-dimensional attribute is a dimension attribute, the adapted dynamic quantity screening threshold may be determined according to a dimension size characterized by the dimension attribute. That is, it may be determined that the number represented by the dynamic number screening threshold is large when the area of the detection frame represented by the size attribute is small, and the number represented by the dynamic number screening threshold is small when the area of the detection frame represented by the size attribute is large. Therefore, the key points with a large number can be determined at least aiming at the target object of the long shot in the image to be detected, and the key points with a small number are determined aiming at the target object of the foreground in the image to be detected, so that the depth detection accuracy of the target object is ensured, the detection efficiency and the self-adaptive capacity of detecting the target object at different depth positions are improved, and the detection efficiency and accuracy of the target object are improved.

According to the embodiment of the application, the dynamic screening threshold value can be determined according to any one or more two-dimensional attributes such as the size attribute and the classification attribute of the target object as long as the actual requirement can be met.

According to an embodiment of the present application, in operation S230, a target object in an image to be detected is detected according to a target prediction depth to include the following operations.

And detecting the target object in the image to be detected according to the target prediction depth to obtain a three-dimensional detection frame representing the target object.

As shown in fig. 6, in the application scenario 600, the acquired image 610 to be detected may be input into a target object detection model 620, and the image 610 to be detected may include a target object, i.e., a vehicle 611.

The target object detection model 620 may include a semantic feature extraction layer 621, a keypoint depth prediction layer 6221, a confidence prediction layer 6222, a two-dimensional detection box prediction layer 6223, a classification prediction layer 6224, a target prediction depth output layer 623, and a three-dimensional detection box output layer 624.

The semantic feature extraction layer 621 can be constructed based on a residual error network (ResNet) algorithm and is used for extracting image semantic features in the image 610 to be detected. The extracted semantic features of the image can be input into a keypoint depth prediction layer 6221, a confidence prediction layer 6222, a two-dimensional detection frame prediction layer 6223, and a classification prediction layer 6224, respectively. By adding supervision signals to the center point in the target object center point thermodynamic diagram corresponding to the vehicle 611 and the neighboring key points adjacent to the center point, the key point depth prediction and the confidence degree prediction can be performed on 8 key points including the center point, that is, the key point depth prediction layer 6221 can output the initial key point depth d of each of the 8 key points ₁ 、d ₂ 8230d ₈ The confidence prediction layer 6222 outputs the confidence u of the depth information of each of the 8 key points ₁ 、u ₂ 8230a ₈ 。

Initial keypoint depth d ₁ 、d ₂ 8230d ₈ And depth information confidence u ₁ 、u ₂ 8230a nd ₈ Can be input to the target predicted depth output layer 6The target predicted depth output layer 623 may calculate a target predicted depth corresponding to the target object (vehicle 611) based on the following formula (2).

；（2）

In the formula (2), d _i Representing the respective initial keypoint depths, u, of the keypoints _i Representing and initial keypoint depth d _i A corresponding depth information confidence, D, representing a target predicted depth corresponding to the target object.

Accordingly, the two-dimensional detection frame prediction layer 6223 may output two-dimensional attributes characterizing the target object detection frame, for example, size attributes such as a length, a width, and the like of the two-dimensional detection frame, position attributes of the two-dimensional detection frame, and orientation angle attributes and the like. The classification prediction layer 6224 may output the classification attribute of the target object as a vehicle class.

The size attribute and the orientation angle attribute output by the two-dimensional detection frame prediction layer 6223, the classification attribute output by the classification prediction layer 6224, and the target prediction depth are input to the three-dimensional detection frame output layer 624, so that the three-dimensional detection frame 630 representing the target object can be obtained, thereby realizing the detection of the target object in the image 610 to be detected,

according to the embodiment of the application, compared with the method that the three-dimensional detection frame corresponding to the target object is detected only by identifying the depth of the central point of the target object in the image to be detected, the target prediction depth is generated according to the initial key point depth and the depth information confidence of each of the plurality of key points by determining the plurality of key points including the central point, so that the target prediction depth can be closer to the real depth of the target object, and the technical problem that the depth information of the target object is detected incorrectly due to the position prediction error of the central point or the detection precision is lower due to the depth prediction error of the central point can be at least partially avoided. And then the three-dimensional detection frame generated according to the target prediction depth can more accurately represent the three-dimensional attribute information such as the position, size, shape and the like of the target object in the space to be detected, and further the technical effect of improving the detection precision of the image to be detected is realized.

According to the embodiment of the application, in the application scene obtained after the image to be detected is subjected to image acquisition by the industrial robot provided with the image acquisition device, in operation S230, the target object in the image to be detected is detected according to the target predicted depth, and the distance between the target object in the image to be detected and the operating part of the industrial robot can be determined according to the target predicted depth, so that the industrial robot can accurately control the operating part to execute the operation on the target object according to the target predicted depth, and the operation precision is improved.

It should be noted that the target object detection method provided in the embodiment of the present application may also be applied to various application scenarios, such as an automatic driving assistance application scenario, for example, a vehicle equipped with an image capture device, and the target object detection method provided in the embodiment may be used to process an image to be detected captured by the image capture device, detect a target object, such as a traffic signboard and another vehicle, in a space around the vehicle, and control the vehicle to run according to a detection result.

As shown in FIG. 7, the training method includes operations S710-S730.

In operation S710, a sample image to be detected is input to the initial deep learning model, and a sample initial keypoint depth of a sample keypoint of the sample target object and a sample depth information confidence corresponding to the sample initial keypoint depth in the sample image are output.

In operation S720, a predicted three-dimensional detection frame of the sample target object is determined according to the sample initial keypoint depth and the sample two-dimensional attribute of the sample target object.

In operation S730, the initial deep learning model is trained by using the labeled three-dimensional detection frame, the predicted three-dimensional detection frame and the sample depth information confidence corresponding to the sample target object, so as to obtain a trained deep learning model.

According to the embodiment of the application, the sample image may include a vehicle, but is not limited thereto, and may further include a traffic signboard or other target object.

According to an embodiment of the present application, the initial deep learning model may include an algorithm model with initial parameters, but is not limited thereto, and may also include an algorithm model with pre-adjusted parameters.

It should be noted that the number of sample target objects in the sample image may be 1 or multiple, and the embodiment of the present application does not limit the number of sample target objects in the sample image.

According to an embodiment of the present application, the sample key points of the sample target object may include any type of points on the sample target object in the space to be detected, for example, points on the edge of a two-dimensional detection frame characterizing the sample target object or any point within the two-dimensional detection frame. Accordingly, the number of the sample key points may be one or more, and the embodiment of the present application does not limit the type of the sample key points and/or the number of the key points, and a person skilled in the art may select the sample key points according to actual needs as long as the sample target object can be characterized.

According to the embodiment of the present application, the sample initial key point depth may include a distance between a sample key point and an image acquisition device that acquires a sample image, but is not limited thereto, and may also include a distance between a sample target object detection device such as a vehicle on which the image acquisition device is mounted and the sample key point.

According to the embodiment of the application, the sample depth information confidence coefficient can represent the corresponding sample initial key point depth and the contribution degree of the sample initial key point depth to the prediction of the depth position of the sample target object.

According to the embodiment of the application, the predicted three-dimensional detection frame is determined by using the initial key point depth of the sample and the two-dimensional attribute of the sample of the target object, and the target object detection model is obtained by using the label three-dimensional detection frame, the predicted three-dimensional detection frame and the sample depth information confidence coefficient training, so that the prediction accuracy of the target object detection model for the initial key point depth and the depth confidence coefficient can be improved, and the error between the subsequently obtained target prediction depth and the real depth of the target object is further reduced.

According to an embodiment of the present application, the above-described target object detection method may be performed based on a trained deep learning model. For example, the image to be detected may be processed based on the trained deep learning model to detect the target object in the image to be detected, and generate a three-dimensional detection frame representing the target object. Or after the image to be detected is processed based on the trained deep learning model, detection information such as depth information between the image acquisition device and the target object is generated.

In any embodiment of the present application, the sample image may be obtained through various public and legal compliance manners, for example, an image obtained by acquiring an image of a sample space corresponding to authorization information after authorization of a relevant organization or a user is obtained, or a deep learning model training method in an embodiment of the present application is performed by an organization or a user having a relevant image acquisition authority and an image analysis authority.

According to an embodiment of the present application, in operation S730, training the initial deep learning model by using the labeled three-dimensional detection frame corresponding to the sample target object, the predicted three-dimensional detection frame and the sample depth information confidence coefficient may include the following operations.

Determining sample overlapping degree information between the label three-dimensional detection frame and the prediction three-dimensional detection frame; inputting the sample overlapping degree information and the sample depth information confidence degree corresponding to the initial key point depth of the sample into a loss function, and outputting a loss value; adjusting parameters of the initial deep learning model according to the loss value until the loss function is converged; and determining the corresponding initial deep learning model as the trained deep learning model under the condition that the loss function is converged.

According to the embodiment of the application, the sample overlapping degree information may be information such as a numerical value or a parameter capable of representing the degree of overlapping between the predicted three-dimensional detection frame and the label three-dimensional detection frame. The sample overlap information may be determined, for example, by calculating an Intersection Over Union (IOU) between the predicted three-dimensional detection box and the labeled three-dimensional detection box. However, the sample overlap degree information may be obtained by other calculation methods such as a General Intersection Over Unit (GIOU) and a Distance Intersection Over Unit (DIOU), as long as the overlap degree between the representation prediction three-dimensional detection frame and the label three-dimensional detection frame can be satisfied.

According to an embodiment of the present application, the sample keypoints of the sample target object include a plurality.

In operation S710, inputting the sample image to be detected to the initial deep learning model, and outputting the sample initial keypoint depth and the sample depth information confidence may include the following operations.

Inputting a sample image to be detected to an initial semantic feature extraction layer of an initial deep learning model, and outputting sample image semantic features, wherein the initial deep learning model further comprises an initial key point depth prediction layer and an initial confidence coefficient prediction layer; and respectively inputting the semantic features of the sample image into an initial key point depth prediction layer and an initial confidence degree prediction layer, and outputting sample initial key point depths corresponding to a plurality of sample key points of the sample target object and sample depth information confidence degrees corresponding to the plurality of sample initial key point depths.

According to an embodiment of the present application, the training method of the deep learning model may further include the following operations.

Determining a sample two-dimensional attribute of a sample target object according to the semantic features of the sample image, wherein the sample two-dimensional attribute of the sample target object comprises a sample target object center point thermodynamic diagram; and determining a plurality of sample key points of the sample target object according to the sample central point corresponding to the sample target object in the sample target object central point thermodynamic diagram.

According to an embodiment of the present application, determining a plurality of sample keypoints of the sample target object according to the sample center point corresponding to the sample target object in the sample target object center point thermodynamic diagram may include the following operations.

Screening a sample adjacent key point corresponding to the sample central point from the sample target object central point thermodynamic diagram according to the sample central point and a static screening threshold value; and determining the sample neighboring keypoints and the sample center point as a plurality of sample keypoints of the sample target object.

According to an embodiment of the application, the sample two-dimensional properties of the sample target object further comprise at least one of:

sample class attribute, sample orientation angle attribute, sample size attribute.

Determining a dynamic screening threshold corresponding to the sample target object according to the sample two-dimensional attribute of the sample target object; and determining a plurality of sample key points of the sample target object according to the sample central point and the dynamic screening threshold value.

As shown in fig. 8, the target object detecting apparatus 800 includes a prediction module 810, a target prediction depth determining module 820, and a detecting module 830.

The predicting module 810 is configured to perform depth information prediction on a target object in an image to be detected, so as to obtain an initial key point depth of a key point of the target object and a depth information confidence corresponding to the initial key point depth.

And a target prediction depth determining module 820, configured to determine a target prediction depth of the target object according to the depth information confidence and the initial keypoint depth.

And the detecting module 830 is configured to detect a target object in the image to be detected according to the target prediction depth.

Wherein, the prediction module comprises: the device comprises a semantic feature extraction unit and a prediction unit.

And the semantic feature extraction unit is used for inputting the image to be detected to a semantic feature extraction layer of the target object detection model and outputting the semantic features of the image, wherein the target object detection model further comprises a key point depth prediction layer and a confidence coefficient prediction layer.

And the prediction unit is used for respectively inputting the semantic features of the image into the key point depth prediction layer and the confidence degree prediction layer and outputting the initial key point depth corresponding to each of the plurality of key points of the target object and the depth information confidence degree corresponding to each of the plurality of initial key point depths.

According to an embodiment of the present application, the target object detection apparatus further includes: the device comprises a two-dimensional attribute determining module and a key point determining module.

And the two-dimensional attribute determining module is used for determining the two-dimensional attribute of the target object according to the semantic features of the image, wherein the two-dimensional attribute of the target object comprises a target object central point thermodynamic diagram.

And the key point determining module is used for determining a plurality of key points of the target object according to the central point corresponding to the target object in the central point thermodynamic diagram of the target object.

According to an embodiment of the application, the key point determination module comprises: a neighboring keypoint screening unit and a first keypoint determination unit.

And the adjacent key point screening unit is used for screening the adjacent key points corresponding to the central point from the thermodynamic diagram of the central point of the target object according to the central point and the static screening threshold value.

A first keypoint determination unit for determining neighboring keypoints and a center point as a plurality of keypoints of the target object.

category attribute, orientation angle attribute, size attribute.

According to an embodiment of the application, the key point determination module comprises: a dynamic screening threshold value determining unit and a second key point determining unit.

And the dynamic screening threshold determining unit is used for determining a dynamic screening threshold corresponding to the target object according to the two-dimensional attribute of the target object.

And the second key point determining unit is used for determining a plurality of key points of the target object according to the central point and the dynamic screening threshold.

According to an embodiment of the application, the detection module comprises a detection unit.

And the detection unit is used for detecting the target object in the image to be detected according to the target prediction depth to obtain a three-dimensional detection frame representing the target object.

As shown in fig. 9, the training apparatus 900 for deep learning model includes a sample image processing module 910, a predicted three-dimensional detection frame determining module 920, and a training module 930.

The sample image processing module 910 is configured to input a sample image to be detected into the initial deep learning model, and output a sample initial keypoint depth of a sample keypoint of the sample target object in the sample image and a sample depth information confidence corresponding to the sample initial keypoint depth.

And a predicted three-dimensional detection frame determining module 920, configured to determine a predicted three-dimensional detection frame of the sample target object according to the sample initial keypoint depth and the sample two-dimensional attribute of the sample target object.

The training module 930 is configured to train the initial deep learning model by using the label three-dimensional detection frame, the prediction three-dimensional detection frame and the sample depth information confidence corresponding to the sample target object, so as to obtain a trained deep learning model.

According to an embodiment of the application, the training module comprises: the device comprises a sample overlapping degree information determining unit, a loss value determining unit, a parameter adjusting unit and a deep learning model determining unit.

And the sample overlapping degree information determining unit is used for determining the sample overlapping degree information between the label three-dimensional detection frame and the prediction three-dimensional detection frame.

And the loss value determining unit is used for inputting the sample overlapping degree information and the sample depth information confidence degree corresponding to the initial key point depth of the sample into the loss function and outputting a loss value.

And the parameter adjusting unit is used for adjusting the parameters of the initial deep learning model according to the loss value until the loss function is converged.

And the deep learning model determining unit is used for determining the corresponding initial deep learning model as the trained deep learning model under the condition that the loss function is converged.

The sample image processing module may include: the device comprises a sample semantic feature extraction unit and a sample prediction unit.

And the sample semantic feature extraction unit is used for inputting the sample image to be detected to an initial semantic feature extraction layer of the initial deep learning model and outputting the semantic features of the sample image, wherein the initial deep learning model further comprises an initial key point depth prediction layer and an initial confidence coefficient prediction layer.

And the sample prediction unit is used for respectively inputting the semantic features of the sample image into the initial key point depth prediction layer and the initial confidence degree prediction layer and outputting the sample initial key point depth corresponding to each of the plurality of sample key points of the sample target object and the sample depth information confidence degree corresponding to each of the plurality of sample initial key point depths.

According to an embodiment of the present application, the training apparatus for deep learning model may further include: a sample two-dimensional attribute determining module and a sample key point determining module.

And the sample two-dimensional attribute determining module is used for determining the sample two-dimensional attributes of the sample target object according to the semantic features of the sample image, wherein the sample two-dimensional attributes of the sample target object comprise a sample target object center point thermodynamic diagram.

And the sample key point determining module is used for determining a plurality of sample key points of the sample target object according to the sample center point corresponding to the sample target object in the sample target object center point thermodynamic diagram.

According to an embodiment of the present application, the sample keypoint determination module may include: a sample neighboring keypoint screening unit and a first sample keypoint determination unit.

And the sample adjacent key point screening unit is used for screening the sample adjacent key points corresponding to the sample center points from the sample target object center point thermodynamic diagram according to the sample center points and the static screening threshold values.

And the first sample key point determining unit is used for determining the sample adjacent key points and the sample central point as a plurality of sample key points of the sample target object.

According to an embodiment of the present application, the sample keypoint determination module may include: a dynamic screening threshold determination unit and a second sample key point determination unit.

And the dynamic screening threshold determining unit is used for determining a dynamic screening threshold corresponding to the sample target object according to the sample two-dimensional attribute of the sample target object.

And the second sample key point determining unit is used for determining a plurality of sample key points of the sample target object according to the sample central point and the dynamic screening threshold value.

According to embodiments of the present application, an electronic device, a readable storage medium, and a computer program product are also provided.

According to an embodiment of the present application, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to an embodiment of the present application, a non-transitory computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described above.

According to an embodiment of the application, a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.

FIG. 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the device 1000 can also be stored. The calculation unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

A number of components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and a communication unit 1009 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1009 allows the device 1000 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 1001 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1001 executes the respective methods and processes described above, such as a target object detection method or a training method of a deep learning model. For example, in some embodiments, the target object detection method or the training method of the deep learning model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM 1002 and/or communications unit 1009. When the computer program is loaded into the RAM 1003 and executed by the computing unit 1001, one or more steps of the target object detection method or the training method of the deep learning model described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured in any other suitable way (e.g. by means of firmware) to perform a target object detection method or a training method of a deep learning model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present application may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A target object detection method, comprising:

carrying out depth information prediction on a target object in an image to be detected to obtain initial key point depth of key points of the target object and a depth information confidence coefficient corresponding to the initial key point depth;

determining the target prediction depth of the target object according to the depth information confidence and the initial key point depth; and

and detecting the target object in the image to be detected according to the target prediction depth.

2. The method of claim 1, wherein the target object's keypoints comprise a plurality;

the depth information prediction of the target object in the image to be detected comprises the following steps:

inputting the image to be detected to a semantic feature extraction layer of a target object detection model, and outputting image semantic features, wherein the target object detection model further comprises a key point depth prediction layer and a confidence coefficient prediction layer;

and respectively inputting the image semantic features into the key point depth prediction layer and the confidence coefficient prediction layer, and outputting initial key point depths corresponding to a plurality of key points of the target object and depth information confidence coefficients corresponding to the initial key point depths.

3. The method of claim 2, further comprising:

determining two-dimensional attributes of the target object according to the image semantic features, wherein the two-dimensional attributes of the target object comprise a target object center point thermodynamic diagram; and

and determining a plurality of key points of the target object according to the central point of the target object central point thermodynamic diagram.

4. The method of claim 3, wherein determining the plurality of keypoints for the target object from a center point in the target object center point thermodynamic diagram corresponding to the target object comprises:

according to the central point and a static screening threshold value, screening out an adjacent key point corresponding to the central point from the thermodynamic diagram of the central point of the target object; and

determining the neighboring keypoints and the central point as a plurality of the keypoints of the target object.

5. The method of claim 4, wherein the static screening threshold comprises at least one of:

6. The method of claim 3, wherein the two-dimensional properties of the target object further comprise at least one of:

category attribute, orientation angle attribute, size attribute.

7. The method of claim 6, wherein determining the plurality of keypoints for the target object from a center point in the target object center point thermodynamic diagram corresponding to the target object comprises:

determining a dynamic screening threshold corresponding to the target object according to the two-dimensional attribute of the target object; and

and determining a plurality of key points of the target object according to the central point and the dynamic screening threshold value.

8. The method of claim 1, wherein detecting a target object in the image to be detected according to the target prediction depth comprises:

9. A training method of a deep learning model comprises the following steps:

inputting a sample image to be detected into an initial deep learning model, and outputting sample initial key point depth of sample key points of a sample target object in the sample image and a sample depth information confidence coefficient corresponding to the sample initial key point depth;

determining a predicted three-dimensional detection frame of the sample target object according to the sample initial key point depth and the sample two-dimensional attribute of the sample target object;

and training the initial deep learning model by utilizing the label three-dimensional detection frame corresponding to the sample target object, the prediction three-dimensional detection frame and the sample depth information confidence coefficient to obtain the trained deep learning model.

10. The training method of claim 9, wherein training the initial deep learning model with the labeled three-dimensional detection box corresponding to the sample target object, the predicted three-dimensional detection box, and the sample depth information confidence comprises:

determining sample overlapping degree information between the label three-dimensional detection frame and the prediction three-dimensional detection frame;

inputting the sample overlapping degree information and a sample depth information confidence coefficient corresponding to the initial key point depth of the sample into a loss function, and outputting a loss value;

adjusting parameters of the initial deep learning model according to the loss value until the loss function converges; and

and determining the corresponding initial deep learning model as the trained deep learning model under the condition that the loss function is converged.

11. A target object detection apparatus comprising:

the prediction module is used for predicting depth information of a target object in an image to be detected to obtain initial key point depth of key points of the target object and a depth information confidence coefficient corresponding to the initial key point depth;

a target prediction depth determination module, configured to determine a target prediction depth of the target object according to the depth information confidence and the initial keypoint depth; and

and the detection module is used for detecting the target object in the image to be detected according to the target prediction depth.

12. The apparatus of claim 11, wherein the target object's keypoints comprise a plurality;

wherein the prediction module comprises:

the semantic feature extraction unit is used for inputting the image to be detected to a semantic feature extraction layer of a target object detection model and outputting image semantic features, wherein the target object detection model further comprises a key point depth prediction layer and a confidence coefficient prediction layer;

and the prediction unit is used for inputting the semantic features of the image into the key point depth prediction layer and the confidence coefficient prediction layer respectively and outputting initial key point depths corresponding to a plurality of key points of the target object and depth information confidence coefficients corresponding to the initial key point depths.

13. The apparatus of claim 12, further comprising:

the two-dimensional attribute determining module is used for determining the two-dimensional attribute of the target object according to the image semantic features, wherein the two-dimensional attribute of the target object comprises a target object center point thermodynamic diagram; and

and the key point determining module is used for determining a plurality of key points of the target object according to the central point corresponding to the target object in the target object central point thermodynamic diagram.

14. The apparatus of claim 13, wherein the keypoint determination module comprises:

the adjacent key point screening unit is used for screening the adjacent key points corresponding to the central point from the thermodynamic diagram of the central point of the target object according to the central point and a static screening threshold value; and

a first keypoint determination unit for determining the neighboring keypoints and the central point as a plurality of the keypoints of the target object.

15. The apparatus of claim 14, wherein the static screening threshold comprises at least one of:

16. The apparatus of claim 13, wherein the two-dimensional attributes of the target object further comprise at least one of:

category attribute, orientation angle attribute, size attribute.

17. The apparatus of claim 16, wherein the keypoint determination module comprises:

the dynamic screening threshold value determining unit is used for determining a dynamic screening threshold value corresponding to the target object according to the two-dimensional attribute of the target object; and

18. The apparatus of claim 11, wherein the detection module comprises:

19. A training apparatus for deep learning models, comprising:

the sample image processing module is used for inputting a sample image to be detected into the initial deep learning model and outputting the sample initial key point depth of the sample key points of the sample target object in the sample image and the sample depth information confidence corresponding to the sample initial key point depth;

the predicted three-dimensional detection frame determining module is used for determining a predicted three-dimensional detection frame of the sample target object according to the sample initial key point depth and the sample two-dimensional attribute of the sample target object;

and the training module is used for training the initial deep learning model by utilizing the label three-dimensional detection frame corresponding to the sample target object, the prediction three-dimensional detection frame and the sample depth information confidence coefficient to obtain a trained deep learning model.

20. The training device of claim 19, the training module comprising:

a sample overlapping degree information determining unit for determining sample overlapping degree information between the label three-dimensional detection frame and the prediction three-dimensional detection frame;

a loss value determining unit, configured to input the sample overlap degree information and a sample depth information confidence corresponding to the sample initial keypoint depth to a loss function, and output a loss value;

the parameter adjusting unit is used for adjusting the parameters of the initial deep learning model according to the loss value until the loss function is converged; and

21. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 10.

22. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 10.