CN111753739A

CN111753739A - Object detection method, device, equipment and storage medium

Info

Publication number: CN111753739A
Application number: CN202010593140.XA
Authority: CN
Inventors: 周定富; 宋希彬; 卢飞翔; 方进; 张良俊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-26
Filing date: 2020-06-26
Publication date: 2020-10-09
Anticipated expiration: 2040-06-26
Also published as: CN111753739B

Abstract

The application discloses an object detection method, an object detection device, equipment and a storage medium, and relates to the technical fields of artificial intelligence, object detection, deep learning, neural networks, automatic driving, unmanned driving, auxiliary robots, virtual reality, augmented reality and the like. The object detection method comprises the following steps: detecting the image to obtain initial three-dimensional position information, a first map and a model category of an object in the image; projecting by using the initial three-dimensional position information of the object and the model category to obtain a second map; and correcting the initial three-dimensional position information of the object by using the first map and the second map. According to the method and the device, the initial three-dimensional position information can be corrected by using the model type of the object, and the more accurate three-dimensional position of the object can be obtained.

Description

Object detection method, device, equipment and storage medium

Technical Field

The application relates to the field of image processing, in particular to the technical fields of artificial intelligence, object detection, deep learning, neural networks, automatic driving, unmanned driving, auxiliary robots, virtual reality, augmented reality and the like.

Background

In the existing detection technology, a three-dimensional object is described as a generalized three-dimensional surrounding frame in many cases. The three-dimensional object detection problem is to use image information to regress (regression) the numerical value of the three-dimensional bounding box. Based on the thought, a plurality of related three-dimensional detection methods are provided. Such methods treat vehicle detection as a regression problem, and the calculation process is complicated.

Disclosure of Invention

The application provides an object detection method, device, equipment and storage medium.

According to a first aspect of the present application, there is provided an object detection method comprising:

detecting the image to obtain initial three-dimensional position information, a first map and a model category of an object in the image;

projecting by using the initial three-dimensional position information of the object and the model category to obtain a second map;

and correcting the initial three-dimensional position information of the object by using the first map and the second map.

According to a second aspect of the present application, there is provided an object detecting apparatus comprising:

the detection module is used for detecting the image to obtain initial three-dimensional position information, a first map and a model category of an object in the image;

the projection module is used for projecting by utilizing the initial three-dimensional position information and the model type of the object to obtain a second map;

and the correction module is used for correcting the initial three-dimensional position information of the object by utilizing the first map and the second map.

According to a third aspect of the present application, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the object detection method of any one of the embodiments of the aspects described above.

According to a fourth aspect of the present application, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the object detection method in any one of the above-described aspects.

According to the technical scheme, the initial three-dimensional position information can be corrected by utilizing the model type of the object, and the more accurate three-dimensional position of the object can be obtained.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a flow chart of an object detection method according to an embodiment of the present application;

FIG. 2 is a schematic view of a UV map;

FIG. 3 is a flow chart of an object detection method according to another embodiment of the present application;

FIG. 4 is a flow chart of an object detection method according to another embodiment of the present application;

FIG. 5 is a flow chart of an object detection method according to another embodiment of the present application;

FIG. 6 is a schematic diagram of the relationship of the camera coordinate system to the road surface coordinate system;

FIG. 7 is a flow chart of an object detection method according to another embodiment of the present application;

FIG. 8 is a flow chart of one example of three-dimensional object detection based on a single frame image;

FIG. 9 is a schematic diagram of estimating a contact point O' of a center point of a vehicle with a ground surface;

FIGS. 10a, 10b and 10c are schematic diagrams of predicted UV maps;

FIG. 11 is a block diagram of an object detection device according to an embodiment of the present application;

FIG. 12 is a block diagram of an object detection device according to another embodiment of the present application;

fig. 13 is a block diagram of an electronic device of an object detection method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a flow chart of an object detection method according to an embodiment of the present application, which may include:

s101, detecting the image to obtain initial three-dimensional position information, a first map and a model type of the object in the image.

And S102, projecting by using the initial three-dimensional position information and the model type of the object to obtain a second map.

S103, correcting the initial three-dimensional position information of the object by using the first map and the second map.

The image in the embodiment of the present application may include a frame image in a video, a shot photograph, and the like. For example, a frame image in a video taken in a vehicle-mounted camera (may also be referred to as a video camera), a photograph taken with a mobile phone, and the like, various types of obstacles may be included in the image. There are various methods of object detection. For example, an artificial intelligence algorithm such as a neural network can be used for training to obtain a detection model capable of identifying the object. The image is detected by using the object detection model to obtain a two-dimensional detection frame of the object in the image, and the position information of the two-dimensional detection frame may include coordinates of the two-dimensional detection frame where the object is located, for example, coordinates of an upper left corner and coordinates of a lower right corner. In addition, the three-dimensional position of the image can be predicted by using the object detection model, and the initial three-dimensional position information of each object in the image can be estimated. The initial three-dimensional position information of the object may include at least one of the following information: the size of the object, the coordinates of the three-dimensional coordinates of the center point, and the orientation angle, etc. The model for performing the two-dimensional detection and the three-dimensional detection may be the same model or different models.

In S101, the image may be UV-segmented by the object detection model to obtain a map of the object. The map of the object may be a UV map. The UV map may include a planar representation of the surface of the three-dimensional model. As shown in fig. 2, as an example of the UV map, UV segmentation is performed on the three-dimensional vehicle model of the left map, and UV segmentation results on a two-dimensional plane of the right map can be obtained, and these UV segmentation results can be referred to as UV maps. The UV map may include (u, v) values corresponding to the positions (x, y) of the pixel points on the two-dimensional plane. The UV map may establish a correspondence between two-dimensional image pixel points and a three-dimensional object model.

In S102, a second map may be obtained by performing projection using the initial three-dimensional position information and the model category of the object detected from the image and the image camera internal parameters. The second map may also be a UV map. The initial three-dimensional position information of the object may include three-dimensional coordinates (X) of a center point of the object_c,Y_c,Z_c) And the orientation of the object, etc.

In addition, there are strong a priori information about objects in many scenes in real life, for example, these a priori information include but are not limited to: the shape of the object, the size of the object, the properties of the object, and where the object may appear, etc. For example, for vehicle detection problems in an autonomous driving scenario, the models of the vehicles are not very different, nor are the sizes very different. It is considered that these prior information can be utilized in object detection, and the problem of object detection can be simplified.

In the embodiment of the present application, three-dimensional models of various categories included in an object may be established in advance. For example, in intelligent transportation technology, for example, in the scenes of automatic driving, unmanned driving and assisted driving, three-dimensional models of vehicles of various types, such as a car model, a bread car model, an SUV (Sports Utility Vehicle, or off-road Vehicle) model, a bus model and the like, can be established in advance. In the auxiliary robot scene, the three-dimensional position information of the objects in the surrounding scene can be acquired by utilizing the image information, so that the robot is assisted to avoid obstacles and grab the objects. In a virtual reality or augmented reality scene, three-dimensional information of an object is restored through an image, so that the virtual object is placed in a real scene.

In the previous data annotation, each object in the sample image may be annotated with a category: for example, 0 represents a car, 1 represents an SUV, 2 represents a minibus, etc. The labeled data sets are then used to train an object detection model using a neural network, such that the object detection model can be used to identify the class of the vehicle. If the model type of the object A is recognized to be the car, the car model corresponding to the car can be found.

Further, a three-dimensional model built in advance may be acquired according to the model category, and three-dimensional coordinates (X) of the center point of the object calculated previously may be used_c,Y_c,Z_c) And the orientation of the object, as the coordinates and orientation of the center point of the three-dimensional model, and rendering the three-dimensional model such as sedanThe vehicle model is projected on the image to obtain a projected second map. The second map may be a UV map. The second map may be the same size as the original image and the first map, and each two-dimensional pixel (x, y) coordinate of the second map has a corresponding (v, u) coordinate.

Illustratively, the correspondence of the three-dimensional point coordinates of the three-dimensional model to the UV Map (U-V-Map) can be obtained by the following steps.

The first step is as follows: first, a corresponding relation between a three-dimensional model of an object and a UV Map (U-V-Map) is established. The establishment of this correspondence may be done by standard UV mapping (U-V-mapping). After the correspondence is completed, the following correspondence may be obtained: three-dimensional points (X) on a three-dimensional model_m,Y_m,Z_m)->Corresponding to the U and V coordinates on the UV Map (U-V-Map), thus resulting in (U, V). Furthermore, the value of (u, v) in the UV map may also correspond to a component property of this three-dimensional point, such as a door, a tail, etc., respectively.

The second step is that: modeling three-dimensional points (X)_m,Y_m,Z_m) For example, the correspondence between three-dimensional coordinate points of the vehicle model and two-dimensional points (x, y) of the image coordinate system. The establishment of this correspondence is a process of annotated camera projection. The following formula (1):

(x,y,1)＝K*[R,T]*[X_m,Y_m,Z_m,1]^T/(Z) (1)，

wherein K ═ fx, 0, c _ x; 0, fy, c _ y; 0,0,1]The internal parameters of the camera are fx and fy, the focal lengths of the camera on the X axis and the Y axis are respectively, and c _ X and c _ Y represent the translation amount of the origin of a camera coordinate system; r, T is the projection matrix of the object in the camera coordinate system (containing position and orientation information), and]^Tis to represent transposing. This equation is a standard camera perspective projection equation.

The above steps can establish a point (X, y) of a two-dimensional image coordinate system and a three-dimensional object coordinate system (X)_m,Y_m,Z_m) The relationship between them. Simultaneous three-dimensional object coordinate system (X)_m,Y_m,Z_m) The correspondence with the UV coordinate system is already predetermined in advance. Thus passing through the three-dimensional coordinate point (X) of the model_m,Y_m,Z_m) As the intermediate link, the (u, v) value corresponding to each image pixel point (x, y) can be obtained. This may result in a second UV map.

In the embodiment of the application, the initial three-dimensional position information can be corrected by using the prior information of the object, such as the model category of the vehicle, and the like, so that the more accurate three-dimensional position of the object can be obtained.

Fig. 3 is a flow chart of an object detection method according to another embodiment of the present application. The same description of this embodiment as that of the previous embodiment has the same meaning, and is not repeated here.

On the basis of the foregoing embodiment, in a possible implementation manner, in S102, detecting an image to obtain initial three-dimensional position information of an object in the image includes:

s201, carrying out object detection on the first image to obtain a two-dimensional detection frame of the object; in this step, a first map of the object may also be obtained.

S202, acquiring a second image including the object from the first image by using the two-dimensional detection frame of the object.

S203, predicting the first image and the second image by utilizing a neural network to obtain a prediction result, wherein the prediction result comprises a junction point of the central point of the object and the ground. In this step, the obtained prediction result may further include a model category of the object.

And S204, calculating initial three-dimensional position information of the central point of the object by utilizing the intersection point of the central point of the object and the ground.

If the first image comprises a plurality of objects, the first image is detected through the object detection model, and a two-dimensional detection frame of the plurality of objects can be obtained. And then, respectively cutting the first image to obtain a second image by using the two-dimensional detection frame of each object. For example, the first image includes an object a and an object B, and the second image including the object a and the second image including the object B can be cropped from the first image.

For each object, the original first image may be merged with a second image comprising the object using a feature layer of the neural network, respectively. For example, combining the first image with the features of the second image including the object a results in one feature map, and combining the first image with the features of the second image including the object B results in another feature map. And then inputting the combined feature map into a full-connection layer of the neural network, wherein the obtained prediction result can comprise coordinates of the intersection point of the central point of the object and the ground.

In the embodiment of the application, the initial three-dimensional position information of an object is predicted by using an original image (a first image) and an image (a second image) including the object and obtained by cutting from the original image, and the prediction result is more accurate.

Fig. 4 is a flow chart of an object detection method according to another embodiment of the present application. The same description of this embodiment as that of the above embodiment has the same meaning and is not repeated here.

On the basis of any of the foregoing embodiments, in a possible implementation manner, in S203, predicting the first image and the second image by using a neural network to obtain a prediction result, including:

s301, acquiring the characteristics of the first image and the characteristics of the second image;

s302, inputting the characteristics of the first image and the second image into a characteristic layer of the neural network for combination;

and S303, inputting the combined features into a plurality of full-connection layers of the neural network for prediction to obtain a prediction result.

Illustratively, if a second image including an object a and a second image including an object B are derived from the original first image, the object a and the object B are predicted separately. For the object A, after the first image and the second image including the object A are merged by using a neural network, the intersection point O 'of the central point of the object A (for example, the central point of the object can be represented by three-dimensional coordinates) and the ground surface is predicted'_AAnd the projection point O of the central point of the object A on the two-dimensional image_AAnd the model class of the object a can also be predicted. For object B, the first image and the included object are combined by using a neural networkB, after the second images are merged, the intersection point O 'between the center point of the object B and the ground surface is obtained through prediction'_BAnd the projection point O of the center point of the object B on the two-dimensional image_BAnd the model class of the object B can also be predicted.

Then, using the intersection point of the center point of the object a and the ground, initial three-dimensional position information of the center point of the object a can be calculated. And calculating initial three-dimensional position information of the central point of the object B by using the intersection point of the central point of the object B and the ground. The initial three-dimensional position information of the center point of the object a and/or the object B calculated here can be expressed by using three-dimensional coordinates.

In the embodiment of the application, after the features of the original image (the first image) and the detected features of the image (the second image) including the object are combined in the neural network, the predicted three-dimensional position coordinate of the object is more accurate, so that the number of subsequent corrections is reduced, and the calculation amount is reduced.

Fig. 5 is a flow chart of an object detection method according to another embodiment of the present application. The same description of this embodiment as that of the above embodiment has the same meaning and is not repeated here.

Based on any of the above embodiments, in a possible implementation manner, in S204, calculating initial three-dimensional position information of the center point of the object by using an intersection point of the center point of the object and the ground includes:

s401, obtaining the distance between the camera and an object by using the normal vector of the height of the camera and the ground;

s402, calculating initial three-dimensional position information of the central point of the object by using the distance and the intersection point of the central point of the object and the ground.

For example, referring to fig. 6, if the vehicle is driven on a relatively flat road surface, the Camera Height (Camera Height) of the captured image is h, and the Normal Vector (Normal Vector) in the Camera coordinate system is h

Wherein the distance Z can also be understood as a device carrying a camera, e.g. a certain camera mounted deviceThe distance of the vehicle from the object.

And, any point on the ground plane G ═ X, Y, Z)^TSatisfies formula (2):

(n_x，n_y，n_z)*(X,Y,Z)^T＝h (2)

assuming that the road surface is a plane, n_xAnd n_zAre all equal to 0, the value of h is a known quantity, and the value of Z can be calculated according to equation (2).

In the camera coordinate system, the relationship between the three-dimensional point coordinates (X, Y, Z) and the corresponding image point coordinates (X, Y) satisfies the following equations (3) and (4):

X＝(x–u0)/fx*Z (3)，

Y＝(y–v0)/fy*Z (4)，

where (u0, v0) is the optical center position of the camera, and fx, fy are the focal lengths of the camera on the X-axis and Y-axis, respectively. In most cases, fx and fy are represented by a uniform focal length f.

By combining equations (2) - (4), the three-dimensional coordinates (X, Y, Z) of any ground point can be solved.

Also, if the intersection point O ' of the object with the ground has been estimated in step 203 described above, the three-dimensional position information corresponding to O ' can be calculated by using the coordinates (X, y) of O ' in the image in combination of equations (2) to (4), and the calculation result can be expressed as the three-dimensional coordinates (X) of the center point of the object_c,Y_c,Z_c)。

In the embodiment of the application, the initial three-dimensional position information of the central point of the object can be rapidly and accurately calculated by using the parameters of the camera, such as the height and the intersection point of the central point of the object and the ground. Then, the coordinates of the three-dimensional position are used as initial three-dimensional position information of the object as initial values for further optimization next.

Fig. 7 is a flow chart of an object detection method according to another embodiment of the present application. The same description of this embodiment as that of the above embodiment has the same meaning and is not repeated here.

In a possible implementation manner based on any one of the above embodiments, in S103, correcting the initial three-dimensional position information of the object by using the first map and the second map includes:

s501, establishing a loss function by using the first map and the second map;

and S502, correcting the initial three-dimensional position information of the object by using the loss function.

The three-dimensional position of the object may include coordinates, an orientation angle, and the like of a center point of a three-dimensional detection frame of the object.

The loss function may be calculated by the difference of the UV coordinates of the first and second maps. For example, each pixel point (x, y) of the first map has a corresponding (u, v) coordinate, and each pixel point (x, y) of the second map has a corresponding (u ', v') coordinate. The value of the loss function can be calculated using the difference between the (u, v) coordinates of the pixel at the same position (x, y) coordinate in the first map and the (u ', v') coordinates in the second map. For example, the loss function is equal to the sum of the differences of the U, V or UV coordinates of the pixels at all positions in two pixels.

And correcting the initial three-dimensional position information of the object by using the loss function to obtain corrected position information.

Illustratively, the correction process may include adjusting the position and orientation information of the object in the initial projection matrix R, T of the camera coordinate system, and then substituting the new R, T into equation (1) above to obtain the (u ', v') coordinates of the new second map, and thus the new UV map. The new UV-map and the coordinates (u, v) of the last initial UV-map are substituted into the loss function and the change in the value of the loss function is compared. If the loss function value becomes small, the (u ', v') coordinates of the new UV map are taken as the coordinates (u, v) of the initial UV map for the next optimization. If the loss function value becomes large, the initial UV map is kept unchanged. Until the value of the loss function meets the requirement, e.g. is less than a certain threshold. And finally obtaining the corrected three-dimensional position information of the object.

In the embodiment of the application, the loss function is established by utilizing the predicted second map and the segmented first map, and the initial three-dimensional position information of the initial object is corrected, so that more accurate three-dimensional position information of the object can be obtained.

In one application example, three-dimensional position information of an object can be detected by using a single frame image and an end-to-end three-dimensional object detection algorithm of a three-dimensional model. Specifically, the detection process may include the following procedures:

s1: inputting a single frame image into the object detection model, and acquiring a two-dimensional detection frame of the object to be detected and a UV segmentation result corresponding to each object to be detected by using a detection algorithm (such as Mask-RCNN). For example, as shown in fig. 8, UV-Seg represents a UV segmentation result, which can be represented by a UV Map (UV-Map); bboxes represent the object frame detection results. If multiple objects are included in the image, each object may have a corresponding object frame, such as a two-dimensional detection frame. Further, an image including the object can be cut out from the original image using the two-dimensional detection frame.

S2: and respectively extracting the characteristics in the whole image and the cut image only containing the object by using a deep learning network (such as Res-Net 50). Referring to fig. 8, by detecting the image by using the object detection model, the Features of the original image and the Features of cropping (Cropped Features) can also be obtained. The two are merged at the Feature layer to obtain a Shared Feature Map (Shared Feature Map). For example, there are three vehicles in fig. 8, and an image including each vehicle may be cut out and combined with the original image to obtain three combined images. For example, the feature of the original image and the feature of the cropping are respectively represented by F_allAnd F_objectAnd (4) showing. F is to be_allAnd F_objectMerging is performed at the feature layer. Characteristic F_allComprises W × H × C1 and characteristic F_objectComprises W × H × C2, which is combined to become W × H × (C1+ C2).

Then, the prediction (pro posal) result of the object is output through two (or more) Layers of convolutional neural network (fully connected Layers such as Deep Render Layers (Deep Render Layers) in fig. 8). As shown in fig. 9, the propofol results may include, but are not limited to: the three-dimensional detection frame ABCD of the object, the three-dimensional size of the object, the intersection point O 'of the three-dimensional central point and the ground, and the projection point O, E of the three-dimensional central point on the two-dimensional image represent the same row of points on the ground as O'. Three-dimensional points of the object can be calculated using the image coordinates of the E point.

In addition, the propofol results may also include the initial model class. For example, the categories of vehicles may include cars, SUVs, vans, buses, etc., each category corresponding to a model of a vehicle.

Referring to fig. 8, object models corresponding to model classes can be obtained from object models (ObjectModels) of various classes prepared in advance according to the model classes. Then, Object Refinement (Object Refinement) indicates that the three-dimensional Object model is refined next, which may be referred to as a correction process.

S3: by using the camera height h and the normal vector n of the ground, a preliminary object distance Z can be estimated. Using the estimated Z and the estimated intersection O' of the three-dimensional center point of the object with the ground, initial three-dimensional position information (X, Y, Z) of the center point of an object can be calculated, see fig. 6 above.

Examples of estimating the distance Z of the object using the camera height h and the normal vector n of the ground, and obtaining initial three-dimensional position information of the center point of the preliminary object include:

if the vehicle runs on a relatively flat road surface, the height of a camera for shooting the image is h, and a normal vector under a camera coordinate system is n ═ n (n)_x，n_y，n_z) Then any point on the ground plane X ═ (X, Y, Z)^TSatisfies the following formula (3-1):

(n_x，n_y，n_z)*(X,Y,Z)T＝h(3-1)，

in the camera coordinate system, the three-dimensional point coordinates (X, Y, Z) and the corresponding image point coordinates (X, Y) satisfy the following relationship:

X＝(x–u0)/fx*Z(3-2)，

Y＝(y–v0)/fy*Z(3-3)，

where (u0, v0) is the optical center position of the camera, and fx, fy are the focal lengths of the cameras on the X, Y axis, respectively. In most cases fx and/or fy and/are represented by a uniform focal length.

By combining the three formulas of formulas (3-1) - (3-3), the three-dimensional coordinates of any ground point can be solved.

The initial three-dimensional coordinates of the vehicle are calculated by using the image coordinate position of O 'in association with equations (3-1) - (3-3) based on the intersection point O' of the vehicle and the ground estimated in step S2. Then, the coordinates are used as initial position information of the vehicle as an initial value for further optimization.

S4, using the initial three-dimensional point position and the estimated category of the object, e.g., vehicle model, and using rendering techniques, the three-dimensional model of the vehicle may be projected onto the image to obtain a projected UV Map, see predicted UV in fig. 8, which may be represented by U 'V' -Map, see fig. 10a, 10b and 10c, an example of U '-Map rendered from the original image Map 10a being fig. 10b and an example of V' -Map rendered being fig. 10 c. On this basis, the difference between the projected U 'V' -Map and the segmented UV-Map (e.g., U '-U and V' -V) is established, see fig. 8, Rendering and comparing the losses (rending and Compare Loss). The difference is used as an energy loss function to correct the three-dimensional position information of the object, such as correcting the position estimation of the vehicle and the orientation angle of the vehicle.

FIG. 11 is a block diagram of an object detection device according to an embodiment of the present application. The apparatus may include:

the detection module 210 is configured to detect an image to obtain initial three-dimensional position information of an object in the image, a first map, and a model category;

the projection module 220 is configured to perform projection by using the initial three-dimensional position information of the object and the model type to obtain a second map;

and a correcting module 230, configured to correct the initial three-dimensional position information of the object by using the first map and the second map.

As shown in fig. 12, in one possible implementation, the detection module 210 includes:

the detection submodule 211 is configured to perform object detection on the first image to obtain a two-dimensional detection frame of the object;

an acquisition sub-module 212 for acquiring a second image including the object from the first image using the two-dimensional detection frame of the object;

the prediction submodule 213 is configured to predict the first image and the second image by using a neural network, so as to obtain a prediction result, where the prediction result includes an intersection point of a center point of the object and the ground;

and the calculating submodule 214 is used for calculating initial three-dimensional position information of the central point of the object by utilizing the intersection point of the central point of the object and the ground.

In a possible embodiment, the prediction submodule 213 is specifically configured to:

acquiring the characteristics of the first image and the characteristics of the second image;

inputting the characteristics of the first image and the second image into a characteristic layer of a neural network for merging;

and inputting the combined features into a plurality of full-connection layers of the neural network for prediction to obtain a prediction result.

In a possible implementation, the computation submodule 214 is specifically configured to:

obtaining the distance between the camera and the object by using the normal vector of the height of the camera and the ground;

and calculating initial three-dimensional position information of the central point of the object by using the distance and the intersection point of the central point of the object and the ground.

In one possible implementation, the modification module 230 includes:

a loss function sub-module 231 for establishing a loss function using the first map and the second map;

and the modification submodule 232 is configured to modify the initial three-dimensional position information of the object by using a loss function.

The functions of each module in each apparatus in the embodiment of the present application may refer to corresponding descriptions in the above method, and are not described herein again.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 13, it is a block diagram of an electronic device according to an object detection method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 13, the electronic apparatus includes: one or more processors 901, memory 902, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). The present embodiment takes a processor 901 as an example.

Memory 902 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the object detection method provided by the present application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the object detection method provided herein.

The memory 902, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., the detection module 210, the projection module 220, and the modification module 230 shown in fig. 11) corresponding to the object detection method in the embodiments of the present application. The processor 901 executes various functional applications of the server and data processing, i.e., implements the object detection method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 902.

The memory 902 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device of the object detection method, and the like. Further, the memory 902 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 902 may optionally include a memory remotely located from the processor 901, which may be connected to the electronics of the object detection method via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the object detection method may further include: an input device 903 and an output device 904. The processor 901, the memory 902, the input device 903, and the output device 904 may be connected by a bus or other means, and fig. 10 illustrates an example of a connection by a bus.

The input device 903 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus of the object detection method, such as an input device of a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or the like. The output devices 904 may include a display device, auxiliary lighting devices (e.g., LEDs), tactile feedback devices (e.g., vibrating motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. An object detection method comprising:

detecting an image to obtain initial three-dimensional position information, a first map and a model category of an object in the image;

projecting by using the initial three-dimensional position information and the model category of the object to obtain a second map;

2. The method of claim 1, wherein detecting an image to obtain initial three-dimensional position information of an object in the image comprises:

carrying out object detection on the first image to obtain a two-dimensional detection frame of the object;

acquiring a second image including the object from the first image by using the two-dimensional detection frame of the object;

predicting the first image and the second image by utilizing a neural network to obtain a prediction result, wherein the prediction result comprises a junction point of a central point of the object and the ground;

and calculating initial three-dimensional position information of the central point of the object by using the intersection point of the central point of the object and the ground.

3. The method of claim 2, wherein predicting the first image and the second image using a neural network to obtain a prediction result comprises:

inputting the features of the first image and the second image into a feature layer of the neural network for merging;

and inputting the combined features into a plurality of full-connection layers of the neural network for prediction to obtain the prediction result.

4. The method of claim 2 or 3, wherein calculating initial three-dimensional position information of the center point of the object using an intersection point of the center point of the object and the ground comprises:

5. The method of claim 4, wherein correcting the initial three-dimensional position information of the object using the first map and the second map comprises:

establishing a loss function by using the first map and the second map;

and correcting the initial three-dimensional position information of the object by using the loss function.

6. An object detecting device comprising:

the projection module is used for projecting by utilizing the initial three-dimensional position information and the model category of the object to obtain a second map;

7. The apparatus of claim 6, wherein the detection module comprises:

the detection submodule is used for carrying out object detection on the first image to obtain a two-dimensional detection frame of the object;

an acquisition sub-module configured to acquire a second image including the object from the first image using the two-dimensional detection frame of the object;

the prediction submodule is used for predicting the first image and the second image by utilizing a neural network to obtain a prediction result, and the prediction result comprises an intersection point of a central point of the object and the ground;

and the calculation submodule is used for calculating the initial three-dimensional position information of the central point of the object by utilizing the intersection point of the central point of the object and the ground.

8. The apparatus of claim 7, wherein the predictor module is specifically configured to:

9. The apparatus according to claim 7 or 8, wherein the computation submodule is specifically configured to:

10. The apparatus of claim 9, wherein the modification module comprises:

a loss function sub-module, configured to establish a loss function using the first map and the second map;

and the correction submodule is used for correcting the initial three-dimensional position information of the object by utilizing the loss function.

11. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.