CN116343153A

CN116343153A - Target detection method and device, electronic equipment and storage medium

Info

Publication number: CN116343153A
Application number: CN202310318803.0A
Authority: CN
Inventors: 姚卓坤
Original assignee: Beijing Jingdong Qianshi Technology Co Ltd
Current assignee: Beijing Jingdong Qianshi Technology Co Ltd
Priority date: 2023-03-29
Filing date: 2023-03-29
Publication date: 2023-06-27

Abstract

The disclosure provides a target detection method and device, electronic equipment and storage medium, and can be applied to the technical fields of intelligent driving, automatic driving, environment sensing and the like. The target detection method comprises the following steps: extracting initial pose information of a target object from a scene image to be processed; adding position labeling information to a target object in a scene image to be processed based on the initial pose information to obtain a labeling scene image; inputting the annotation scene image into a correction network, and outputting pose deviation represented by position annotation information associated with initial pose information in the annotation scene image; and correcting the initial pose information based on the pose deviation to obtain the target pose information of the target object.

Description

Target detection method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the technical field of intelligent driving, automatic driving, environmental awareness, and the like, and in particular, to a target detection method, apparatus, device, medium, and program product.

Background

In the field of autopilot, 3D object detection of images is a necessary task. And excavating characteristics of the input image signals to obtain pose information of the target object, so as to realize the perception of 3D information of the scene. In the process of implementing the disclosed concept, the inventor finds that at least the following problems exist in the related art: the existing 3D target detection technology is low in precision, and the pose of a target object cannot be estimated accurately.

Disclosure of Invention

In view of the foregoing, the present disclosure provides a target detection method, apparatus, device, medium, and program product.

In one aspect of the present disclosure, there is provided a target detection method including:

extracting initial pose information of a target object from a scene image to be processed, wherein the scene image to be processed is used for representing the surrounding driving environment of a target vehicle in the driving process, and the surrounding driving environment of the target vehicle comprises at least one object;

adding position labeling information to a target object in a scene image to be processed based on the initial pose information to obtain a labeling scene image;

inputting the annotation scene image into a correction network, and outputting pose deviation represented by position annotation information associated with initial pose information in the annotation scene image;

and correcting the initial pose information based on the pose deviation to obtain the target pose information of the target object.

According to an embodiment of the present disclosure, adding position labeling information to a target object in a scene image to be processed based on initial pose information to obtain a labeled scene image includes:

and drawing an external frame of the target object in the scene image to be processed based on the initial pose information to obtain the annotation scene image.

According to an embodiment of the present disclosure, drawing an circumscribed frame of a target object in a scene image to be processed based on initial pose information, obtaining a labeling scene image includes:

calculating to obtain first position coordinates of a plurality of corner points of an external frame of a target object in a camera coordinate system based on initial pose information;

converting the first position coordinates of the plurality of corner points to obtain second position coordinates of the plurality of corner points of the external frame of the target object in the image coordinate system;

and drawing an external frame of the target object in the scene image to be processed based on the second position coordinates of the plurality of corner points, so as to obtain the annotation scene image.

According to an embodiment of the present disclosure, wherein:

the initial pose information includes: initial position coordinates of a center point of the target object, an initial three-dimensional size of the target object, and an initial direction angle of the target object, wherein the initial direction angle is used for representing: an angle between the first travel direction of the target object and the second travel direction of the target vehicle.

According to an embodiment of the present disclosure, wherein:

the pose deviation comprises the following steps: position deviation of a center point of the target object, three-dimensional size deviation of the target object, and direction angle deviation of the target object;

The target pose information includes: the target position coordinates of the center point of the target object, the target three-dimensional size of the target object and the target direction angle of the target object;

correcting the initial pose information based on the pose deviation to obtain target pose information of the target object, wherein the obtaining the target pose information comprises the following steps:

calculating according to the initial position coordinates and the position deviation to obtain target position coordinates;

calculating according to the initial three-dimensional size and the three-dimensional size deviation to obtain a target three-dimensional size;

and calculating according to the initial direction angle and the direction angle deviation to obtain the target direction angle.

According to an embodiment of the present disclosure, extracting initial pose information of a target object from a scene image to be processed includes:

inputting the scene image to be processed into a target detection network, and outputting initial pose information of a target object.

According to an embodiment of the present disclosure, further comprising, after obtaining the target pose information of the target object:

and carrying out repeated iterative correction on the target pose information until a preset termination condition is reached, so as to obtain final-stage pose information.

According to an embodiment of the present disclosure, performing iterative correction on target pose information for a plurality of times until a termination condition is preset, and obtaining final-stage pose information finally determined includes:

Performing ith iteration correction on the target pose information to obtain ith iteration pose information;

determining the ith iteration pose deviation corresponding to the ith iteration pose information;

and (3) performing iteration: and carrying out the (i+1) -th iteration correction on the (i) th iteration pose information based on the (i) th iteration pose deviation to obtain (i+1) -th iteration pose information, and determining the (i+1) -th iteration pose deviation corresponding to the (i+1) -th iteration pose information until the (i+1) -th iteration pose deviation is small and a preset deviation threshold value, so as to obtain final-stage pose information.

Another aspect of the present disclosure provides an object detection apparatus, including:

the extraction module is used for extracting initial pose information of the target object from a scene image to be processed, wherein the scene image to be processed is used for representing the surrounding driving environment of the target vehicle in the driving process, and the surrounding driving environment of the target vehicle comprises at least one object;

the labeling module is used for adding position labeling information to a target object in the scene image to be processed based on the initial pose information to obtain a labeled scene image;

the correction module is used for inputting the annotation scene image into the correction network and outputting pose deviation represented by the position annotation information associated with the initial pose information in the annotation scene image;

And the correction module is used for correcting the initial pose information based on the pose deviation to obtain the target pose information of the target object.

Another aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the target detection method described above.

Another aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described object detection method.

Another aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above-described object detection method.

The target detection method according to the embodiment of the disclosure adopts the concept of two-stage processing, and realizes the pose information detection of the object. In the first stage, the image is subjected to feature mining through a 3D target detection technology based on vision, and initial three-dimensional pose information of a target object is output. And further correcting the initial three-dimensional pose information in the second stage. The scene image to be processed is a two-dimensional image, and because the two-dimensional image lacks three-dimensional geometric information (especially depth information) of the scene, the three-dimensional information of the object is estimated directly by adopting a visual detection technology in the first stage, the pose of the target object cannot be estimated accurately, and therefore the obtained detection result is not accurate enough. Based on the above, the embodiment of the disclosure can obtain a more accurate detection result by further correcting the three-dimensional pose information of the object which is preliminarily extracted and has low accuracy. Therefore, through the two-stage image processing strategy, the low-quality detection result in the first stage is re-projected into the two-dimensional image and visualized information labeling is carried out, the labeled image is subsequently corrected through the correction model in the second stage, the automatic evaluation and correction of the quality of the previous result are realized, the pose information after correction and fine adjustment is more accurate, the accuracy of object information detection is improved, and further, the unmanned vehicle has higher automatic driving safety in the process of realizing automatic driving based on more accurate object information.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates an application scenario diagram of a target detection method, apparatus, device, medium and program product according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a target detection method according to an embodiment of the disclosure;

FIG. 3 illustrates a process schematic diagram of a target detection method according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow diagram of processing a to-be-processed scene image to obtain an annotated scene image, according to an embodiment of the disclosure;

FIG. 5 schematically illustrates a block diagram of an object detection apparatus according to an embodiment of the present disclosure;

fig. 6 schematically illustrates a block diagram of an electronic device adapted to implement a target detection method according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In embodiments of the present disclosure, the collection, updating, analysis, processing, use, transmission, provision, disclosure, storage, etc., of the data involved (including, but not limited to, user personal information) all comply with relevant legal regulations, are used for legal purposes, and do not violate well-known. In particular, necessary measures are taken for personal information of the user, illegal access to personal information data of the user is prevented, and personal information security, network security and national security of the user are maintained.

In embodiments of the present disclosure, the user's authorization or consent is obtained before the user's personal information is obtained or collected.

The embodiment of the disclosure provides a target detection method, which comprises the following steps:

extracting initial pose information of a target object from a scene image to be processed, wherein the scene image to be processed is used for representing the surrounding driving environment of a target vehicle in the driving process, and the surrounding driving environment of the target vehicle comprises at least one object; adding position labeling information to a target object in a scene image to be processed based on the initial pose information to obtain a labeling scene image; inputting the annotation scene image into a correction network, and outputting pose deviation represented by position annotation information associated with initial pose information in the annotation scene image; and correcting the initial pose information based on the pose deviation to obtain the target pose information of the target object.

Fig. 1 schematically illustrates an application scenario diagram of a target detection method, apparatus, device, medium and program product according to an embodiment of the present disclosure.

As shown in fig. 1, an application scenario 100 according to this embodiment may include a vehicle 101, a server 102. The communication between the vehicle 101 and the server 102 may be via a network, which may include various connection types, such as wired, wireless communication links, or fiber optic cables, etc.

An autopilot module is mounted in the vehicle 101 for guiding the vehicle to achieve autopilot. An image acquisition device, such as a camera, is installed in the vehicle 101 for acquiring an image of the surrounding environment during the running of the vehicle 101, and a source radar may be installed for acquiring point cloud data of the surrounding environment of the vehicle 101.

The server 102 may be a server providing various services, for example, may be a background management server, and may perform processing such as analysis on received data such as a user request, and may feed back processing results (such as a web page, information, or data acquired or generated according to the user request) to the terminal device.

In the application scenario of the embodiment of the present disclosure, during the driving process of the vehicle 101, the image acquisition device installed therein acquires a scene image of the surrounding environment, the scene image is sent to the server 102, after receiving the scene image, the server 102 may process the scene image based on the image data processing request, for example, by executing the target detection method of the embodiment of the present disclosure, extract initial pose information of the target object from the scene image to be processed, add position labeling information to the target object in the scene image to be processed based on the initial pose information, obtain the labeled scene image, input the labeled scene image into the correction network, output pose deviation represented by the position labeling information associated with the initial pose information in the labeled scene image, and correct the initial pose information based on the pose deviation, so as to obtain target pose information of the target object and return to the automatic driving module in the vehicle 101, so that the automatic driving module in the vehicle 101 assists the vehicle 101 to realize automatic driving according to the object pose information identified in the surrounding environment.

The object detection method of the disclosed embodiment will be described in detail below with reference to fig. 2 to 6 based on the scenario described in fig. 1.

Fig. 2 schematically illustrates a flow chart of a target detection method according to an embodiment of the disclosure.

As shown in fig. 2, the target detection method of this embodiment includes operations S201 to S204.

In operation S201, initial pose information of a target object is extracted from a to-be-processed scene image, wherein the to-be-processed scene image is used to characterize a surrounding driving environment of a target vehicle during driving, and the surrounding driving environment of the target vehicle includes at least one object;

in operation S202, adding position labeling information to a target object in a scene image to be processed based on initial pose information to obtain a labeled scene image;

in operation S203, the annotation scene image is input into the correction network, and the pose deviation represented by the position annotation information associated with the initial pose information in the annotation scene image is output;

in operation S204, the initial pose information is corrected based on the pose deviation, and the target pose information of the target object is obtained.

According to the embodiment of the disclosure, the target vehicle may be an unmanned vehicle of current interest, and in the driving process of the unmanned vehicle, pose information of an object (for example, an obstacle) in a surrounding environment needs to be obtained, and automatic driving of the unmanned vehicle is realized by combining the pose information of the object with other perception information (for example, traffic light state, road indication information and the like).

An image acquisition device, such as a camera, may be installed in the target vehicle for acquiring an image of the surrounding driving environment during the running of the vehicle. The method disclosed by the embodiment of the disclosure is used for detecting the object in the driving environment image so as to obtain the pose information of the object. It should be noted that, the above method according to the embodiments of the present disclosure is a method for processing based on a single frame scene image. That is, the scene image to be processed is a single frame scene image corresponding to a certain moment acquired by the image acquisition device.

According to the embodiment of the disclosure, the target detection method adopts the concept of two-stage processing for detecting pose information of an obstacle.

Specifically, in the first stage, by performing the above operation S201, preliminary three-dimensional pose information extraction is performed on the two-dimensional image, and a detection result with low accuracy is obtained. In the above operation S201, the initial pose information of the target object is extracted from the to-be-processed scene image, for example, the initial pose information of the target object may be output by using the target detection network. For example, the image detection algorithm may be used to process the scene image to be processed to obtain initial pose information (such as three-dimensional information of object position, size, direction inclination angle, etc.) of the target object.

Further, in the second stage, by executing the operations S202 to S204, the three-dimensional pose information of the object which is preliminarily extracted and has low accuracy is further processed, so as to obtain a more accurate detection result.

Specifically, first, through the above operation S202, the position labeling information is added to the target object in the to-be-processed scene image to obtain the labeling scene image, for example, the position labeling frame is added to the target object in the to-be-processed scene image to obtain the labeling scene image based on the three-dimensional pose information result with low accuracy obtained in the first stage.

Next, by performing the above operation S203, pose deviations represented by the position labeling information associated with the initial pose information in the labeling scene image are output through the correction network, such as pose deviations of the object in multiple dimensions including, but not limited to, position deviations, size deviations, direction inclination deviations, and the like.

Then, the above operation S204 may be executed to correct the initial pose information based on the pose deviation, so as to obtain more accurate pose information of the object. After the target pose information of the target object is obtained, the target pose information can be returned to an automatic driving module in the target vehicle, so that the automatic driving module in the vehicle can realize automatic driving of the unmanned vehicle by combining the pose information of the object with other perception information (such as traffic light state, road indication information and the like).

According to the embodiment of the disclosure, the scene image to be processed is a two-dimensional image, and since the two-dimensional image lacks three-dimensional geometric information (especially depth information) of the scene, in the first stage (operation S201 described above), three-dimensional pose information of a target object is output by mining features of the image based on a visual 3D target detection technology, so as to realize perception of the 3D information of the scene. The visual detection technology is directly adopted to estimate the three-dimensional information of the object, and the pose of the target object cannot be estimated accurately, so that the obtained detection result is not accurate enough. Based on this, the embodiment of the present disclosure further corrects the three-dimensional pose information of the object that is primarily extracted and has low accuracy through the operations S202 to S204, so as to obtain a more accurate detection result. Therefore, through the two-stage image processing strategy, the low-quality detection result in the first stage is re-projected into the two-dimensional image and visualized information labeling is carried out, the labeled image is subsequently corrected through the correction model in the second stage, the automatic evaluation and correction of the quality of the previous result are realized, the pose information after correction and fine adjustment is more accurate, the accuracy of object information detection is improved, and further, the unmanned vehicle has higher automatic driving safety in the process of realizing automatic driving based on more accurate object information.

According to an embodiment of the present disclosure, specifically, in a first stage, extracting initial pose information of a target object from an image of a scene to be processed includes: inputting the scene image to be processed into a target detection network, and outputting initial pose information of a target object. The object detection network can be a YOLO network, the YOLO network predicts the type and the position of the object directly through the convolutional neural network, the processing speed is high, the real-time requirement can be met, the requirement on information delay in an automatic driving scene is met, the situation that the image background is mistakenly regarded as the object is less, and the generalization capability is high.

According to an embodiment of the present disclosure, pose information of an object may specifically include: position coordinates of a center point of the target object, three-dimensional dimensions of the target object (such as length, width, height of an external frame of the object), and a direction angle of the target object, wherein the direction angle is used for representing: an angle between the first travel direction of the target object and the second travel direction of the target vehicle.

Based on this, in the first stage, by performing preliminary three-dimensional pose information feature extraction on the two-dimensional image, the obtained initial pose information of the target object includes: initial position coordinates of a center point of the target object, initial three-dimensional dimensions of the target object, and initial direction angles of the target object. In the second stage, the target pose information obtained by further correcting the three-dimensional pose information of the object which is preliminarily extracted and has low accuracy comprises the following steps: the target position coordinates of the center point of the target object, the target three-dimensional size of the target object, and the target direction angle of the target object.

Outputting, via the correction network, pose bias represented by position annotation information associated with the initial pose information in the annotation scene image may include: the position deviation of the center point of the target object, the three-dimensional size deviation of the target object, and the direction angle deviation of the target object.

Further, based on the multi-dimensional information type, correcting the initial pose information based on the pose deviation, and obtaining the target pose information of the target object includes:

calculating according to the initial position coordinates and the position deviation to obtain target position coordinates; calculating according to the initial three-dimensional size and the three-dimensional size deviation to obtain a target three-dimensional size; and calculating according to the initial direction angle and the direction angle deviation to obtain the target direction angle.

Fig. 3 illustrates a process schematic diagram of an object detection method according to an embodiment of the present disclosure. The above-described method of the embodiments of the present disclosure is exemplarily described below with reference to fig. 3.

As shown in fig. 3, in the first stage, a scene image to be processed is input into a target detection network, and initial pose information of a target object (initial position coordinates of a center point of the target object, initial three-dimensional size of the target object, initial direction angle of the target object) is output. The target object may be a plurality of obstacles of a plurality of predetermined types, such as pedestrians, motor vehicles, tricycles, bicycles, etc. The initial pose information includes, for example: initial position coordinates (L1, location 1) of the center point of the object 1, initial three-dimensional dimensions (D1, dimension1, for example, length, width, height of the circumscribed frame of the object 1), initial direction angles (Y1, yaw 1) of the target object; initial pose information (L2, D2, Y2) of the object 2; the initial pose information (L3, D3, Y3) … … of the object 3. The three sets of pose parameters may fully describe pose information of one rigid target object in the 3D scene.

As shown in fig. 3, since the 2D input image lacks depth information of the scene, the pose of the target object output by the target detection network is an inaccurate low-quality detection result, and in the second stage, first, the low-quality detection result is re-projected into the 2D image based on the initial pose information and visualized, that is, the target object in the to-be-processed scene image is added with position labeling information associated with the initial pose information to obtain a labeled scene image, as shown in fig. 3, for example, the labeled scene image may be obtained after adding a position labeling frame to the target object in the to-be-processed scene image.

Then, the pose deviation represented by the position annotation information associated with the initial pose information in the annotation scene image is output through the correction network, and as shown in fig. 3, the output results include, for example, the position deviation (Δl1) of the center point of the object 1, the three-dimensional size deviation (Δd1) of the object 1, the direction angle deviation (Δy1) of the object 1; position deviation (Δl2, Δd2, Δy2) of the object 2; the positional deviation (Δl3, Δd3, Δy3) … … of the object 3.

In the annotation scene image, if the output result of the target detection network is inaccurate, the annotation frame of the object cannot perfectly wrap the object, the correction network can perceive the quality of the output result in the first stage, and the correction network can estimate the position deviation of the 3D pose according to the deviation degree of the frame line position.

Then, the initial pose information is corrected based on the pose deviation to obtain target pose information of the target object, for example, the initial pose information and the pose deviation are added to obtain target pose information. Therefore, the automatic sensing and fine adjustment correction of the quality of the detection result are realized, and a more accurate estimation result is obtained.

As shown in fig. 3, the target position coordinates may be calculated according to the initial position coordinates and the position deviation: ln' =ln+Δln; calculating according to the initial three-dimensional size and the three-dimensional size deviation to obtain a target three-dimensional size: dn' =dn+Δdn; calculating a target direction angle according to the initial direction angle and the direction angle deviation: yn' =yn+Δyn (n=1, 2, 3 … …).

According to the embodiment of the disclosure, based on the initial pose information, the low-quality detection result is re-projected into the 2D image and visualized, that is, the labeling scene image is obtained after adding the position labeling information associated with the initial pose information to the target object in the scene image to be processed, specifically, the external frame of the target object is drawn in the scene image to be processed based on the initial pose information, so as to obtain the labeling scene image.

The method of drawing the circumscribed frame of the target object in the image of the scene to be processed is further described below in conjunction with fig. 4.

Fig. 4 schematically illustrates a flow chart of processing a to-be-processed scene image to obtain an annotated scene image, according to an embodiment of the disclosure.

As shown in fig. 4, the method for processing the scene image to be processed to obtain the labeling scene image includes operations S401 to S403.

In operation S401, based on the initial pose information, first position coordinates of a plurality of corner points of an circumscribed frame of the target object in a camera coordinate system are calculated. Knowing the initial position coordinates L of the center point of the object, the initial three-dimensional dimensions D of the object (length, width, height of the circumscribed frame of the object), and the initial direction angle Y of the target object, the eight angular point position coordinates of the circumscribed frame of the object can be conveniently calculated.

In operation S402, coordinate transformation is performed on the first position coordinates of the plurality of corner points, so as to obtain second position coordinates of the plurality of corner points of the circumscribed frame of the target object in the image coordinate system.

In operation S403, an circumscribed frame of the target object is drawn in the scene image to be processed based on the second position coordinates of the plurality of corner points, to obtain a labeling scene image.

For example, converting the first position coordinates of the eight corner points by using a camera parameter matrix to obtain second position coordinates of the eight corner points in an image coordinate system, and projecting the second position coordinates of the eight corner points into a 2D image to obtain a visual annotation scene image with a low-quality detection result.

The projection equation is shown in the following formula (1):

(ut，Vt，t，1) ^τ ＝P·(x，y，z，1) ^τ ----(1)

wherein x, y, z are the first position coordinates of eight corner points of the circumscribed frame in the camera coordinate system, and u, v are the second position coordinates of eight corner points of the circumscribed frame in the image coordinate system. P is a camera parameter matrix, and t is a projection vector parameter.

According to the embodiment of the disclosure, in the processing method, the first position coordinates of the plurality of corner points of the external frame of the target object in the camera coordinate system are obtained through calculation based on the initial pose information, and further the low-quality detection result is re-projected to the 2D image through coordinate conversion by utilizing the camera parameter matrix calibrated in advance, so that the visual labeling of the initial detection result is realized, the obtained labeled image can be used as basic data of a correction network, the detection result can be automatically evaluated and corrected by utilizing the correction network, and more accurate pose estimation is obtained.

According to an embodiment of the present disclosure, after obtaining the target pose information of the target object, the target detection method of the embodiment of the present disclosure further includes: and carrying out repeated iterative correction on the target pose information until a preset termination condition is reached, so as to obtain final-stage pose information.

Further, performing multiple iterative corrections on the target pose information specifically includes the following operations:

the method comprises the steps of performing first operation, namely performing ith iteration correction on ith-1 iteration target pose information (ith iteration initial pose information) based on ith iteration pose deviation to obtain ith iteration target pose information, wherein the ith iteration target pose information is used as the (i+1) iteration initial pose information.

And a second operation of determining an i+1th iteration pose deviation associated with the i-th iteration target pose information (i+1th iteration initial pose information).

And a third operation, iteratively performing: and (3) carrying out the (i+1) th iteration correction on the (i+1) th iteration target pose information (the (i+1) th iteration initial pose information) based on the (i+2) th iteration pose deviation, obtaining the (i+1) th iteration target pose information, determining the (i+2) th iteration pose deviation associated with the (i+2) th iteration target pose information (the (i+2) th iteration initial pose information) until the iteration pose deviation is small and a preset deviation threshold value is reached, and obtaining the finally determined final stage pose information.

The above specific method for performing each iteration operation in the iterative correction refers to operations S202 to S204 in the foregoing embodiments.

For example, operations S202 to S204 may be regarded as performing the 1 st iteration, and the 1 st iteration initial pose information is corrected based on the pose deviation in operation S204 to obtain target pose information of the target object, where the target pose information is used as the 1 st iteration pose information.

A 2 nd iteration pose bias associated with the 1 st iteration target pose information (2 nd iteration initial pose information) is determined. Specifically referring to operations S202 to S203, it may be: based on the 1 st iteration target pose information, adding position labeling information to a target object in a scene image to be processed to obtain an execution labeling scene image 1; the annotation scene image 1 is input into the correction network, and the 2 nd iteration pose deviation associated with the 1 st iteration target pose information (the 2 nd iteration initial pose information) is output.

And carrying out the 2 nd iteration correction on the 1 st iteration target pose information (the 2 nd iteration initial pose information) based on the 2 nd iteration pose deviation to obtain the 2 nd iteration target pose information.

A 3 rd iteration pose bias associated with the 2 nd iteration target pose information (3 rd iteration initial pose information) is determined.

And carrying out the 3 rd iteration correction on the 2 nd iteration target pose information (the 3 rd iteration initial pose information) based on the 3 rd iteration pose deviation to obtain the 3 rd iteration target pose information.

A 4 th iteration pose bias … … associated with the 3 rd iteration target pose information (4 th iteration initial pose information) is determined. And performing iteration until the deviation of a certain iteration pose is small and a preset deviation threshold value is obtained, and finally determining final-stage pose information.

According to the embodiment of the disclosure, through the operation of performing repeated iterative correction on the target pose information, a more accurate detection result can be obtained.

Based on the target detection method, the disclosure also provides a target detection device. The device will be described in detail below in connection with fig. 5.

Fig. 5 schematically shows a block diagram of a structure of an object detection apparatus according to an embodiment of the present disclosure.

As shown in fig. 5, the object detection device 500 of this embodiment includes an extraction module 501, a labeling module 502, a correction module 503, and a correction module 504.

The extracting module 501 is configured to extract initial pose information of a target object from a to-be-processed scene image, where the to-be-processed scene image is used to represent a surrounding driving environment of a target vehicle during a driving process, and the surrounding driving environment of the target vehicle includes at least one object.

The labeling module 502 is configured to obtain a labeled scene image after adding position labeling information to a target object in the scene image to be processed based on the initial pose information.

The correction module 503 is configured to input the annotation scene image into a correction network, and output pose deviation represented by position annotation information associated with the initial pose information in the annotation scene image.

And the correction module 504 is configured to correct the initial pose information based on the pose deviation, so as to obtain target pose information of the target object.

According to an embodiment of the present disclosure, in the above apparatus, the extraction module 501 outputs initial three-dimensional pose information of the target object in the first stage by mining features of the image through a vision-based 3D target detection technique. The initial three-dimensional pose information is further corrected in the second stage through the labeling module 502, the correcting module 503 and the correcting module 504. The to-be-processed scene image is a two-dimensional image, and because the two-dimensional image lacks three-dimensional geometric information (especially depth information) of the scene, the three-dimensional information of the object is estimated by the extraction module 501 directly adopting a visual detection technology in the first stage, the pose of the target object cannot be estimated accurately, and therefore the obtained detection result is not accurate enough. Based on this, the embodiment of the disclosure further corrects the three-dimensional pose information of the object which is preliminarily extracted and has low accuracy through the labeling module 502, the correction module 503 and the correction module 504, so that a more accurate detection result can be obtained, the accuracy of object information detection is improved, and the higher automatic driving safety is realized in the process that the unmanned vehicle realizes automatic driving based on the more accurate object information.

According to an embodiment of the present disclosure, the labeling module 502 includes a drawing unit, configured to draw an circumscribed frame of the target object in the to-be-processed scene image based on the initial pose information, so as to obtain a labeled scene image.

According to an embodiment of the present disclosure, the drawing unit includes a calculation subunit, a conversion subunit, and a drawing subunit.

The computing subunit is used for computing and obtaining first position coordinates of a plurality of corner points of an external frame of the target object in a camera coordinate system based on the initial pose information; the conversion subunit is used for carrying out coordinate conversion on the first position coordinates of the plurality of corner points to obtain second position coordinates of the plurality of corner points of the external frame of the target object in the image coordinate system; and the drawing subunit is used for drawing an external frame of the target object in the scene image to be processed based on the second position coordinates of the plurality of corner points to obtain the annotation scene image.

According to an embodiment of the present disclosure, wherein the initial pose information includes: initial position coordinates of a center point of the target object, an initial three-dimensional size of the target object, and an initial direction angle of the target object, wherein the initial direction angle is used for representing: an angle between the first travel direction of the target object and the second travel direction of the target vehicle.

According to an embodiment of the present disclosure, wherein the pose bias includes: position deviation of a center point of the target object, three-dimensional size deviation of the target object, and direction angle deviation of the target object; the target pose information includes: the target position coordinates of the center point of the target object, the target three-dimensional size of the target object, and the target direction angle of the target object.

The correction module comprises a first calculation unit, a second calculation unit and a third calculation unit.

The first calculation unit is used for calculating and obtaining a target position coordinate according to the initial position coordinate and the position deviation; the second calculation unit is used for calculating the target three-dimensional size according to the initial three-dimensional size and the three-dimensional size deviation; and the third calculation unit is used for calculating the target direction angle according to the initial direction angle and the direction angle deviation.

According to an embodiment of the present disclosure, the extraction module 501 includes a detection unit, configured to input a scene image to be processed into a target detection network, and output initial pose information of a target object.

According to an embodiment of the disclosure, the apparatus further includes an iteration module, configured to, after obtaining target pose information of the target object, perform multiple iteration correction on the target pose information until a termination condition is preset, and obtain final-stage pose information that is finally determined.

According to an embodiment of the disclosure, the iteration module comprises a correction unit, a determination unit and an iteration unit.

The correction unit is used for carrying out ith iteration correction on the ith-1 th iteration target pose information based on the ith iteration pose deviation to obtain the ith iteration target pose information, wherein the ith iteration target pose information is used as the (i+1) th iteration initial pose information; a determining unit for determining an i+1th iteration pose deviation associated with the i+1th iteration initial pose information; an iteration unit, configured to iteratively perform: and carrying out the (i+1) -th iteration correction on the (i+1) -th iteration target pose information based on the (i+1) -th iteration pose deviation to obtain the (i+1) -th iteration target pose information, determining the (i+2) -th iteration pose deviation related to the (i+1) -th iteration target pose information until the (i+1) -th iteration pose deviation is small and a preset deviation threshold value, and obtaining final-stage pose information finally determined.

Any of the extraction module 501, the labeling module 502, the correction module 503, and the correction module 504 may be combined in one module to be implemented, or any of the modules may be split into a plurality of modules according to embodiments of the present disclosure. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. At least one of the extraction module 501, the labeling module 502, the correction module 503, and the correction module 504 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or by hardware or firmware, such as any other reasonable means of integrating or packaging the circuitry, or in any one of or a suitable combination of three of software, hardware, and firmware, in accordance with embodiments of the present disclosure. Alternatively, at least one of the extraction module 501, the labeling module 502, the correction module 503, and the correction module 504 may be at least partially implemented as computer program modules, which when executed, may perform the corresponding functions.

As shown in fig. 6, an electronic device 600 according to an embodiment of the present disclosure includes a processor 601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. The processor 601 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 601 may also include on-board memory for caching purposes. The processor 601 may comprise a single processing unit or a plurality of processing units for performing different actions of the method flows according to embodiments of the disclosure.

In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are stored. The processor 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. The processor 601 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 602 and/or the RAM 603. Note that the program may be stored in one or more memories other than the ROM 602 and the RAM 603. The processor 601 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the present disclosure, the electronic device 600 may also include an input/output (I/O) interface 605, the input/output (I/O) interface 605 also being connected to the bus 604. The electronic device 600 may also include one or more of the following components connected to an input/output (I/O) interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to an input/output (I/O) interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on drive 610 so that a computer program read therefrom is installed as needed into storage section 608.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 602 and/or RAM 603 and/or one or more memories other than ROM 602 and RAM 603 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to implement the object detection methods provided by embodiments of the present disclosure.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 601. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of signals over a network medium, and downloaded and installed via the communication section 609, and/or installed from the removable medium 611. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 601. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A target detection method comprising:

Based on the initial pose information, adding position labeling information to the target object in the scene image to be processed to obtain a labeling scene image;

inputting the annotation scene image into a correction network, and outputting pose deviation represented by the position annotation information associated with the initial pose information in the annotation scene image;

and correcting the initial pose information based on the pose deviation to obtain target pose information of the target object.

2. The method of claim 1, wherein adding position annotation information to the target object in the scene image to be processed based on the initial pose information to obtain an annotated scene image comprises:

3. The method of claim 2, wherein drawing an circumscribed frame of the target object in the to-be-processed scene image based on the initial pose information, the labeling scene image comprising:

calculating to obtain first position coordinates of a plurality of corner points of an external frame of the target object in a camera coordinate system based on the initial pose information;

Performing coordinate transformation on the first position coordinates of the plurality of corner points to obtain second position coordinates of the plurality of corner points of the circumscribed frame of the target object in an image coordinate system;

4. A method according to any one of claims 1-3, wherein:

the initial pose information includes: the initial position coordinates of the center point of the target object, the initial three-dimensional size of the target object, and the initial direction angle of the target object, wherein the initial direction angle is used for representing: an angle between the first travel direction of the target object and the second travel direction of the target vehicle.

5. The method according to claim 4, wherein:

the pose deviation includes: the position deviation of the center point of the target object, the three-dimensional size deviation of the target object and the direction angle deviation of the target object;

Correcting the initial pose information based on the pose deviation, wherein obtaining the target pose information of the target object comprises the following steps:

calculating according to the initial position coordinate and the position deviation to obtain the target position coordinate;

calculating according to the initial three-dimensional size and the three-dimensional size deviation to obtain the target three-dimensional size;

and calculating the target direction angle according to the initial direction angle and the direction angle deviation.

6. The method of claim 1, wherein the extracting initial pose information of the target object from the image of the scene to be processed comprises:

inputting the scene image to be processed into a target detection network, and outputting initial pose information of the target object.

7. The method of claim 1, further comprising, after obtaining target pose information for the target object:

8. The method of claim 7, wherein performing iterative corrections on the target pose information multiple times until a preset termination condition, obtaining final-stage pose information that is finally determined comprises:

Performing ith iteration correction on the ith-1 th iteration target pose information based on the ith iteration pose deviation to obtain ith iteration target pose information, wherein the ith iteration target pose information is used as the (i+1) th iteration initial pose information;

determining an i+1st iteration pose bias associated with the i+1st iteration initial pose information;

and (3) performing iteration: and carrying out the (i+1) -th iteration correction on the (i+1) -th iteration target pose information based on the (i+1) -th iteration pose deviation to obtain the (i+1) -th iteration target pose information, determining the (i+2) -th iteration pose deviation related to the (i+1) -th iteration target pose information until the (i+1) -th iteration pose deviation is small and a preset deviation threshold value, and obtaining final-stage pose information finally determined.

9. An object detection apparatus comprising:

the extraction module is used for extracting initial pose information of a target object from a scene image to be processed, wherein the scene image to be processed is used for representing the surrounding driving environment of a target vehicle in the driving process, and the surrounding driving environment of the target vehicle comprises at least one object;

the labeling module is used for adding position labeling information to the target object in the scene image to be processed based on the initial pose information to obtain a labeled scene image;

The correction module is used for inputting the annotation scene image into a correction network and outputting pose deviation represented by the position annotation information associated with the initial pose information in the annotation scene image;

10. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-8.

11. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-8.

12. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 8.