CN111144213A

CN111144213A - Object detection method and related equipment

Info

Publication number: CN111144213A
Application number: CN201911175243.8A
Authority: CN
Inventors: 孟令康; 李骊
Original assignee: Beijing HJIMI Technology Co Ltd
Current assignee: Beijing HJIMI Technology Co Ltd
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2020-05-12
Anticipated expiration: 2039-11-26
Also published as: CN111144213B

Abstract

The embodiments of the present application disclose an object detection method and related equipment. After acquiring a depth image to be detected, a foreground area can be segmented from the depth image according to the depth values of the pixels in the depth image. Since the target object has a larger foreground area that is likely to be in the depth image, the corresponding point cloud can be generated only for the foreground area. Thus, based on the generated point cloud, it is detected whether the target object is included in the depth image. This method generates a point cloud from the foreground area of the depth image that is more likely to include the target object and performs further detection, while culling the background area that is unlikely to include the target object, that is, no point cloud is generated for these areas, which reduces the amount of calculation and improves The object detection efficiency and detection real-time performance are improved.

Description

Object detection method and related equipment

Technical Field

The present application relates to the field of computer vision, and in particular, to an object detection method and related apparatus.

Background

Object detection is widely applied in modern life, wherein objects such as human bodies and the like can provide data for a plurality of applications such as security, entertainment, accurate services and the like in places such as shopping malls, stations, families and the like. Under the scene of too dark light, the object detection algorithm based on the color image fails, and the object detection algorithm based on the depth image is good in performance, so that the defects of the color image are well made up.

At present, a method for object detection based on a depth map mainly comprises a point cloud detection algorithm, wherein a depth image is preprocessed and converted into a point cloud, and then a ground point cloud is determined according to prior information, so that a non-ground point cloud is extracted, and the extracted point cloud is further detected. The point cloud is a vector composed of a plurality of space points, and each space point comprises corresponding space coordinate information, color information and the like.

Because the method needs to generate point clouds from scenes, such as the ground, etc., in the depth image, which do not include the target object, and calculate the point clouds, the object detection efficiency is low and the real-time performance is poor.

Disclosure of Invention

In order to solve the technical problem, the application provides an object detection method and related equipment, which reduce the calculated amount and improve the object detection efficiency and detection real-time performance.

The embodiment of the application discloses the following technical scheme:

in a first aspect, an embodiment of the present application provides an object detection method, where the method includes:

acquiring a depth image to be detected;

segmenting a foreground region from the depth image according to the depth value of a pixel point in the depth image;

generating a point cloud corresponding to the foreground area;

and detecting whether a target object is included in the depth image or not according to the point cloud.

In a second aspect, an embodiment of the present application provides an object detection apparatus, including:

the device comprises an acquisition unit, a detection unit and a processing unit, wherein the acquisition unit is used for acquiring a depth image to be detected;

the segmentation unit is used for segmenting a foreground region from the depth image according to the depth value of a pixel point in the depth image;

the generating unit is used for generating a point cloud corresponding to the foreground area;

and the detection unit is used for detecting whether the target object is included in the depth image or not according to the point cloud.

In a third aspect, an embodiment of the present application provides an apparatus for object detection, where the memory is configured to store a program code and transmit the program code to the processor;

the processor is configured to perform the object detection method according to the first aspect according to instructions in the program code.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium for storing program codes, where the program codes are used to execute the object detection method according to the first aspect.

According to the technical scheme, after the depth image to be detected is obtained, the foreground area can be segmented from the depth image according to the depth value of the pixel point in the depth image. Since the target object has a larger foreground region that is likely to be in the depth image, the corresponding point cloud may be generated only for the foreground region. Thus, whether the target object is included in the depth image is detected according to the generated point cloud. According to the method, the point cloud is generated by the foreground area which is more likely to include the target object in the depth image and is further detected, and the background area which is less likely to include the target object is removed, namely the point cloud is not generated in the areas, so that the calculated amount is reduced, and the object detection efficiency and the detection real-time performance are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a flowchart of an object detection method according to an embodiment of the present disclosure;

fig. 2 is a flowchart of a method for segmenting a foreground region from a depth image according to an embodiment of the present disclosure;

fig. 3 is a flowchart of a method for detecting whether a target object is included in a depth image according to a point cloud according to an embodiment of the present application;

FIG. 4 is a flowchart of a method for determining whether a sub-point cloud corresponds to a target object according to an embodiment of the present disclosure;

FIG. 5 is a flowchart of a method for determining whether a forward projection corresponds to a target object according to an embodiment of the present disclosure;

fig. 6 is a flowchart of a human body detection method according to an embodiment of the present application;

fig. 7 is a structural diagram of an object detection apparatus according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

At present, a method for detecting an object based on a depth map mainly includes a point cloud detection algorithm, and since the method needs to generate point clouds in scenes, such as the ground, in which a depth image does not include a target object, and calculate the point clouds, the object detection efficiency is low and the real-time performance is poor.

Therefore, the embodiment of the application provides an object detection method, and the foreground area which is more likely to include the target object in the depth image is generated into the point cloud and is further detected, so that the calculated amount is reduced, and the object detection efficiency and the detection real-time performance are improved.

First, an execution body of the embodiment of the present application will be described. The object detection method provided by the application can be applied to image processing equipment, such as terminal equipment and a server. The terminal device may be a user terminal, and the terminal device may be, for example, an intelligent terminal, a computer, a Personal Digital Assistant (PDA), a tablet computer, and the like.

The object detection method can also be applied to a server, and the server can acquire the depth image to be detected sent by the terminal equipment, detect the object of the depth image and send the detection result to the terminal equipment. The server may be a stand-alone server or a server in a cluster.

In order to facilitate understanding of the technical solution of the present application, a server is taken as an execution subject, and an object detection method provided by the embodiment of the present application is introduced in combination with an actual application scenario.

In this embodiment, the server may obtain a depth image to be detected, where a pixel point in the depth image corresponds to a depth value. The depth value of the pixel point can be used for representing the distance degree between the real scene corresponding to the pixel point and the camera lens, and if the depth value of the pixel point is larger, the real scene corresponding to the pixel point is farther from the camera; if the depth value of the pixel point is smaller, the real scene corresponding to the pixel point is closer to the camera.

A foreground region and a background region may be segmented from the depth image, wherein depth values of pixels in the foreground region are lower than depth values of pixels in the background image. For example, if a depth image captured by the camera includes a human body closer to the camera (with a lower depth value) and a wall farther from the camera (with a higher depth value), an image area corresponding to the human body in the depth image may be used as the foreground area, and an image area corresponding to the wall may be used as the background area.

In an actual scene, when detecting an object based on an image, the object to be detected is likely to be closer to a camera lens, so that a clear image of the object can be obtained, and the accuracy of object detection is improved. As such, the target object that needs to be detected is more likely to be distributed in the foreground region in the depth image than in the background region in the depth image. For example, based on the above example, it is assumed that the target object to be detected is a human body, and an image area corresponding to the human body in the depth image is a foreground area.

Based on the method, the server can segment the foreground area from the depth image according to the depth value of the pixel point in the depth image, and generate corresponding point cloud for the foreground area. Thus, whether the target object is included in the depth image is detected according to the generated point cloud.

By executing the object detection method, the foreground areas which are more likely to include the target object in the depth image are generated into point clouds and are further detected, and background areas which are less likely to include the target object are removed, namely the point clouds are not generated for the areas, so that the calculated amount is reduced, and the object detection efficiency and the detection real-time performance are improved.

Next, the object detection method provided by the embodiment of the present application will be described with a server as an execution subject.

Referring to fig. 1, the figure shows a flowchart of an object detection method provided in an embodiment of the present application, where the method may include:

s101: and acquiring a depth image to be detected.

It should be noted that, in the embodiment of the present application, a manner of obtaining the depth image to be detected by the server is not limited, and a suitable manner may be selected to obtain the depth image to be detected according to an actual situation. For example: the depth image may be acquired by the server after being photographed by a camera with a depth image photographing function, or the depth image may be acquired by the server after being processed into a depth image according to a color image.

S102: and segmenting a foreground region from the depth image according to the depth value of the pixel point in the depth image.

In this embodiment, the server may segment the foreground region from the depth image according to the depth value of the pixel point in the depth image.

For example: a depth threshold value can be preset, and an area formed by pixel points of which the depth values are smaller than the depth threshold value in the depth image is used as a foreground area and is segmented.

In addition, in a possible implementation manner, a pixel point in the depth image may correspond to a background probability condition. Wherein, the background probability condition corresponding to each pixel point in the depth image may be different. When the depth value of a pixel point meets the background probability condition corresponding to the pixel point, the pixel point can be determined as a background pixel point. Then, referring to fig. 2 for a target pixel point in the depth image, this illustrates a flowchart of a method for segmenting a foreground region from the depth image according to an embodiment of the present application, and as shown in fig. 2, the method for segmenting a foreground region from the depth image according to a depth value of a pixel point in the depth image in S102 above may include:

s201: and determining whether the depth value of the target pixel point meets the corresponding background probability condition, and if not, executing S202.

The target pixel point can be any pixel point in the depth image.

S202: and determining the target pixel point as a foreground pixel point belonging to the foreground area.

That is to say, if the depth value of the target pixel does not satisfy the corresponding background probability condition, the target pixel may be considered as a foreground pixel, and an area formed by the foreground pixels is segmented from the depth image and used as a foreground area.

The following describes a specific implementation process of S201-S202, assuming that the length of the depth image is W and the width of the depth image is H, a set of pixels S in the depth image is { (i, j) |0 ≦ i < W, and 0 ≦ j, H }, where (i, j) may identify a position in the depth image, and each pixel (i, j) belongs to S. Note that each pixel point has a background probability distribution corresponding to the mean value of

Standard deviation of

Is normally distributed.

Depth value I based on target pixel point in depth image_i,jThe background probability condition corresponding to the target pixel point may be

If the depth value I_i,jIf the background probability condition is satisfied, the pixel point can be determined as a background pixel point belonging to a background area, and if the background probability condition is not satisfied, the pixel point can be determined as a foreground image belonging to a foreground areaα can be a fixed parameter, and its value can range from 1.0 to 3.0.

In specific implementation, a probability model may be generated based on a background probability condition, so as to input a depth image into the model and output a 0-1 binary image corresponding to the depth image, where a pixel point in the output image is 0 and may represent that the pixel point belongs to a background pixel point in a background region, and a pixel point is 1 and may represent that the pixel point belongs to a foreground pixel point in a foreground region.

After the output image is obtained, the pixel points whose median is 1 can be clustered according to the depth value and the proximity relation, so as to obtain a foreground region, and the foreground region is segmented from the depth image. Wherein the set of the divided foreground regions may be S_F＝{s₁,s₂,Λ,s_nH, each element s in the set_k(k-1, 2, …, n) may be a foreground region, the foreground region

May be a set of coordinates of the foreground region in a coordinate system of the output image. The coordinate system of the output image may be the origin of the coordinate system with the position of the lower left corner of the output image.

By the method, the foreground region can be accurately segmented from the depth image.

S103: and generating a point cloud corresponding to the foreground area.

The point cloud is a vector composed of a plurality of space points, and each space point comprises corresponding space coordinate information, color information and the like.

In an actual scene, when point clouds are generated for all scenes in a depth image, the point clouds need to be generated according to the shape (contour) of each scene, and if the quality of the depth image is poor, the quality of the point clouds generated by the method is low.

In the embodiment of the present application, by performing segmentation (segmenting the foreground region) in advance, since the segmented foreground region already has the segmented contour features, even if the quality of the depth image is not good, the point cloud generated according to the contour of the segmented foreground region in S103 may still have high quality. And compared with the method of segmenting on point cloud in the related technology, the method is fast and efficient.

In a particular implementation, the camera's optical center (c) may be based on_x,c_y) Focal length (f)_x,f_y) Converting the coordinates (i, j) of the pixel points in each foreground area into a space coordinate P under a camera coordinate system_i,j＝(x_i,j,y_i,j,z_i,j)^T，

z_i,j＝I_i,j。

Thus, based on the foreground region set S_F＝{s₁,s₂,Λ,s_nA set of point clouds V ═ V can be generated₁,v₂,Λ,_vn, each point cloud

Corresponding to a foreground region s_kThe point cloud elements in a point cloud are recorded as

S104: and detecting whether the target object is included in the depth image or not according to the point cloud.

Therefore, the server can detect whether the target object is included in the depth image or not according to the point cloud of the foreground area.

In one possible implementation, in order to detect a target object moving in the depth image, the object detection method may further include:

s301: after the detection of the depth images of the continuous T1 frames is completed, for a first set including pixel points at a target position in the depth images of the continuous T1 frames, if the number of foreground pixel points in the first set is not less than a first number threshold, the server may update a background probability condition corresponding to the pixel points at the target position to a foreground probability condition.

The target position may be any position in the depth image, for example, the target position may be a position of a pixel point at an upper left corner of the depth image, and the target position may include one pixel point or a plurality of pixel points.

The first number threshold may be used to measure whether the number of the pixels belonging to the foreground pixels in the first set is proper, and if the number of the pixels belonging to the foreground pixels in the first set is not less than the first number threshold, it may be determined that the number of the pixels belonging to the foreground pixels in the first set is large.

In an actual scene, when it is determined that there are many pixel points belonging to foreground pixel points at a target position in the T1 frame depth image, it can be described that a target object such as a human body may appear in a corresponding actual scene at the target position of the depth image. Meanwhile, if the number of the pixel points belonging to the foreground pixel points at the target position in the T1 frame depth image is large, it may represent that the target object in the real scene corresponding to the target position is still at the position.

In this way, in order to detect a moving target object, the server may update the background probability condition corresponding to the pixel point at the target position to the foreground probability condition. The foreground probability condition may be determined according to depth values of pixels at target positions in the consecutive T1 frame depth images, and the foreground probability condition may be used to determine whether a pixel at a target position in a depth image after the consecutive T1 frame depth images is a foreground pixel. The depth image after the consecutive T1 frame depth images may refer to a depth image acquired after the consecutive T1 frame depth images according to the timing of the video.

Compared with the background probability condition, the foreground probability condition of the pixel point at the target position can determine the pixel point with a smaller depth value as the foreground pixel point. In addition, since the foreground probability condition may be determined according to the depth values of the pixels at the target positions in the consecutive T1 frame depth images, if the depth values of the pixels at the target positions in the depth images after the consecutive T1 frame depth images are not changed, that is, the target object in the real scene corresponding to the target positions is not moved, the depth values of the pixels at the target positions (corresponding to the target object) cannot be determined as foreground pixels based on the foreground probability condition.

Next, based on the specific implementation process of S201-S202, the method of S301 is described, and the background counter can be preset

After detecting each continuous frame depth image in sequence, if the pixel point on the target position is determined as the background pixel point, the pixel point can be detected as the background pixel point

After completing the detection for the depth images of the continuous T1 frames, if

(C_BT1-first number threshold), the background probability condition of a pixel point for that target location may be updated to the foreground probability condition, wherein the background probability condition may be such that

I.e. the foreground probability distribution obeys the mean

Standard deviation of

Is normally distributed.

It should be noted that the embodiments of the present application do not limit the distribution manner of the background probability condition and the foreground probability condition, and for example, a gaussian distribution, a uniform distribution, a non-parametric distribution, and the like may also be used.

In addition, after determining whether a pixel is a foreground pixel based on the background probability condition or the foreground probability condition of each pixel in the depth image, the foreground probability condition or the background probability condition of the pixel can be updated. The updating method can comprise the following steps:

if the pixel point is determined as the background pixel point, the depth value I based on the pixel point_i,jThe background probability condition is updated as follows:

wherein t is a weight parameter; in other cases, among others, it is possible to

Wherein sigma_mIs the upper standard deviation limit. The weighting parameter t may vary over time or other conditions.

By carrying out self-adaptive updating on the foreground probability condition and the background probability condition for the pixel points in the depth image, the moving target object in the depth image can be detected, and the speed of the whole calculation process is improved on the premise of ensuring the stable result.

In a possible implementation manner, if the pixel point on the target location corresponds to the foreground probability condition, the method may further include:

s401: after the detection of the continuous T2 frame depth images in the video is completed, aiming at a second set comprising pixel points at the target position in the continuous T2 frame depth images, if the number of the pixel points belonging to the background in the second set is not less than a second number threshold, updating the foreground probability condition corresponding to the pixel points at the target position to be a background probability condition.

The second number threshold may be used to measure whether the number of the background pixels in the second set is proper, and if the number of the background pixels in the second set is not less than the second number threshold, it may be determined that the number of the pixels in the second set is greater.

In an actual scene, when it is determined that there are more pixel points belonging to the background pixel point at the target position in the T2 frame depth image, it can be described that a target object originally existing in the real scene corresponding to the target position of the depth image has left.

In this way, in order to ensure that the target object is detected, the server may update the foreground probability condition corresponding to the pixel point at the target position to the background probability condition. The background probability condition may be determined according to depth values of pixels at target positions in the consecutive T2 frame depth images, and the background probability condition may be used to determine whether a pixel at a target position in a depth image after the consecutive T2 frame depth images is a foreground pixel.

By the method, the accuracy of target object detection can be improved.

In the actual scene, because reasons such as collection equipment and environment, acquire the not good depth image of quality easily, for example the pixel that corresponds to distant or corner scene in the depth image is the pixel that the depth value is 0 for the zero depth value pixel, such zero depth value pixel is the noise point usually. To this end, in one possible implementation, the method may further include:

s501: after the detection of the continuous T3 frame depth images in the video is completed, for a third set including pixel points at the target position in the continuous T3 frame depth images, if the number of pixel points belonging to zero depth values in the third set is not less than a third number threshold, the target pixel points do not include the pixel points at the target position in the depth images.

The target position may be any position in the depth image, for example, the target position may be a position of a pixel point at an upper left corner of the depth image, and the target position may include one pixel point or a plurality of pixel points. The target position in the present embodiment may be the same as or different from the target position in the foregoing embodiment.

The third number threshold may be used to measure whether the number of the pixels belonging to the zero depth value in the third set is proper, and if the number of the pixels belonging to the zero depth value in the third set is not less than the third number threshold, it may be determined that the number of the pixels belonging to the zero depth value in the third set is greater.

When the server determines that the number of the pixel points belonging to the zero depth value in the third set is too large, that is, it indicates that the pixel point at the target position in the depth image of T3 frame is in a situation where the pixel point is continuously the pixel point of the zero depth value or the pixel point is intermittently the pixel point of the zero depth value, so that the target pixel point in S201 may not include the portion of the pixel point at the target position.

The method of S501 will be described below based on the specific implementation procedures of S201-S202. A zero value counter may be preset

And distribution failure identification as

After sequentially detecting each continuous frame of depth image, if the pixel point on the target position is determined as a zero-depth-value pixel point, the pixel point may be detected by the depth image detection method

After detecting the depth images of the continuous T3 frames, if

Then order

Wherein, C_ZMay be a third quantity threshold, when distributing the failure indication

In this case, it may be indicated that the pixel point at the target position is invalid, that is, the target pixel point in S201 may not include the pixel point at the target position. If it is

If not, then order

It may be indicated that the pixel at the target location is not failed, and the target pixel at S201 may include the pixel at the target location.

By executing the method, the invalid zero-depth-value pixel point can be prevented from being taken as a target pixel point and being segmented into the foreground area, and the influence of the noise point on object detection is reduced.

In an actual scene, a situation may occur in which two or more objects are in contact in a scene that is closer to the camera (divided into the foreground in S102), and thus, one foreground region divided in S102 may be a connected region corresponding to the two or more objects. In this case, in order to improve the accuracy of object detection, in a possible implementation manner, referring to fig. 3, this figure shows a flowchart of a method for detecting whether a target object is included in a depth image according to a point cloud provided in an embodiment of the present application, and as shown in fig. 3, the method for detecting whether a target object is included in a depth image according to a point cloud in S104 above may include:

s601: determining world coordinates of point cloud elements in the point cloud.

In the case where the generated point cloud is a point cloud under the camera coordinate system, the world coordinates of the point cloud elements in the point cloud may also be determined.

In a specific implementation, based on the above S103, a world can be established by using the projection of the camera on the ground as an origin, using the direction from the origin to the camera as a z-axis, and using the projection of the camera visual axis on the ground as a y-axisA coordinate system. And obtaining a transformation matrix M from the camera coordinate system to the world coordinate system through calibration, thereby transforming the point cloud set V under the camera coordinate system into a point cloud set W under the world coordinate system. Wherein the point cloud v_iThe transformation mode of transforming one point cloud element coordinate in the (camera coordinate system) into the point cloud element coordinate in the world coordinate system is (x)_w,y_w,z_w,i,j)^T＝M(x_v,y_v,z_v,i,j)^T，(x_w,y_w,z_w,i,j)^TAs a point cloud V_i(camera coordinate system) one point cloud element coordinate, (x)_v,y_v,z_v,i,j)^TMay be the coordinates of the point cloud elements in the world coordinate system.

S602: and generating a height map and a density map according to world coordinates.

The pixel points in the height map and the density map correspond to ground positions for projecting the point cloud to a ground coordinate system, the pixel points in the height map identify the maximum height value of the ground positions corresponding to the pixel points in the height map, and the maximum height value is the maximum value in the heights corresponding to the point cloud elements projected to the ground positions; the pixel points in the density map identify the number of point cloud elements projected to the corresponding ground location.

In the embodiment of the application, the grid processing is performed according to the world coordinates of the point cloud elements in the point cloud, and a height map and a density map are obtained.

Based on the specific implementation process in S601, the generation manner of the height map and the density map is introduced, and for the point cloud elements of each point cloud in the point cloud set W, the point cloud elements are reduced according to a certain proportionality coefficient β and projected onto the ground plane coordinate system, and a height map H and a density map d are obtained, where each pixel point (k, l) in the height map identifies the maximum height value among the height values of the point cloud elements projected onto the ground position corresponding to the pixel point, and each pixel point (k, l) in the density map identifies the number of the point cloud elements projected onto the ground position corresponding to the pixel point, where k is greater than or equal to 0 and less than W_M,0≤l＜h_M，w_MThe number of pixels, h, included in the height and density maps in the transverse direction_MThe number of pixels included in the height map and the density map in the longitudinal direction is shown.

The maximum height value H of each pixel point in the height map can be calculated by the following formula_k,lAnd the number D of point cloud elements of each pixel point in the density map_k,l：

The calculation amount can be reduced by reducing the point cloud element of each point cloud in the point cloud set W by the scale factor β.

S603: and determining the predicted ground position of the target object according to the height map and the density map.

In one possible implementation, the predicted ground location of the target object may be determined from the height map and the density map and by a first discriminant function. The first discriminant function may be determined according to a relationship between a target height and a target density, where the point cloud corresponding to the target object is projected onto the ground.

Based on the specific implementation process in S602, the following describes a specific implementation manner of S603, and the first discriminant function may be preset as follows:

h and d can be the maximum height value and the number of point cloud elements of the pixel points corresponding to the same ground position in the height map and the density map respectively; u, v may be a tuning parameter.

Then, whether g (H) is satisfied or not can be judged based on the pixel points (k, l) corresponding to the same ground position in the height map and the density map_k,l,D_k,l)＝max{g(H_k+a,l+b,D_k+a,l+b) L (a, b) e b (r), where b (r) may be a pixel neighborhood of radius r in a height map or a density map. If yes, the pixel point (k, l) can be added into the position set Q.

And finally, non-maximization inhibition can be performed on the radius of the set Q based on the target object, so that the ground position distance corresponding to the pixel points in the set is not too close, and a set Q' is obtained.

It should be noted that, when performing the subdivision of the point cloud (i.e., dividing the sub-point cloud), the sub-point cloud may be in the form of a neighborhood, such as a rectangle, an ellipse, or the like, in addition to a circle region.

S604: and dividing the point cloud into sub-point clouds according to the predicted ground position.

Next, the server may re-segment the point cloud according to the predicted ground location to obtain sub-point clouds. The point cloud may be re-segmented based on the shape of the target object, such as based on the predicted ground location.

Based on the specific implementation process in S603, the following describes a specific implementation manner of S604, and for the ground position corresponding to each pixel point (k, l) in the set Q', the ground position is specified according to the specified radius r_bThe point cloud is divided again (corresponding to the size of the target object), and the ground position corresponding to the point is smaller than the radius r_bThe point cloud elements of (a) form a sub-point cloud. The resulting sub-point clouds are then stored in U, i.e., U ═ U₁,u₂Λ, each u_qMay refer to a sub-point cloud.

S605: it is determined whether the sub-point cloud corresponds to the target object.

The data processing by using the point cloud in the related art usually performs voxelization, calculates patch information, and the like to analyze the three-dimensional shape of the object, so that the detection range is large, the distance is long, and the target objects may be shielded, which further causes that the determined three-dimensional shape information of the point cloud may be incomplete or misleading. According to the method provided by the embodiment of the application, the predicted ground position where the pedestrian is likely to appear is obtained through rasterization of the point cloud and non-maximum suppression. And after the predicted ground positions are obtained, the detection algorithm is not directly operated, the point cloud is segmented again, the obtained sub-point cloud is ensured not to belong to a connected region comprising two or more objects, even if the interference of the surrounding connected region is eliminated by the obtained sub-point cloud, and a precondition is provided for accurately detecting the object.

In an actual scene, based on an installation angle of the camera (for obtaining the depth image), which is usually taken from a top view angle, the sub-point cloud is caused to be a point cloud under the top view angle, and thus, in order to improve an accuracy rate of object detection in one possible implementation manner, referring to fig. 4, a flowchart of a method for determining whether the sub-point cloud corresponds to the target object provided by the embodiment of the present application is shown, where the method for determining whether the sub-point cloud corresponds to the target object in S605 may include:

s701: and determining the forward projection of the sub-point cloud along the direction parallel to the ground according to the shooting angle of the camera.

In the embodiment of the application, the sub-point cloud can be corrected to a head-up view angle, and a forward projection of the sub-point cloud along a direction parallel to the ground is determined.

Based on the specific implementation process in S604, the following describes a specific implementation manner of S701, and each sub-point cloud u is processed_qE.g. U, calculated by the following formula

Wherein x is_oIs u_qThe abscissa values of the point cloud elements in (1).

Then, the projection window size (w) may be preset_p,h_p) Obtaining a forward projection J:

it should be noted that, in order to reduce the amount of calculation, when obtaining the forward projection of the sub-point cloud, the scaling parameter s may also be applied_pTo perform scaling. Then, the forward projection J is:

s702: it is determined whether the forward projection corresponds to the target object.

By executing the method, the sub-point cloud is corrected to the point cloud under the head-up view angle, thereby improving the accuracy of object detection.

It should be noted that the target object to be detected is not limited in the embodiments of the present application, and in a possible implementation manner, the target object to be detected may be a human body.

In a possible implementation manner, if a forward projection of the sub-point cloud along a direction parallel to the ground is determined, referring to fig. 5, which shows a flowchart of a method for determining whether the forward projection corresponds to the target object according to an embodiment of the present application, as shown in fig. 5, the method for determining whether the forward projection corresponds to the target object in S702 may include:

s801: and taking the maximum height value of each column of pixel points in the comprehensive projection as the height value of each column of pixel points.

Wherein the synthetic projection may comprise a forward projection of the entire sub-point cloud. Each column of pixel points in the comprehensive projection can have a corresponding height value (longitudinal coordinate value), and the server can take the maximum height value of each column of pixel points as the height value of the column of pixel points, wherein the maximum height value of each column of pixel points can be the maximum longitudinal coordinate value of the column of pixel points.

In a specific implementation, the synthetic projection may be located in one projection image, the pixel point identification value in the region of the synthetic projection in the projection image is a non-zero value, and the pixel point identification value in the projection image except for the synthetic projection is a zero value.

For convenience of subsequent calculation, the height values of each column of pixel points of the synthetic projection can be combined into a vector, which is recorded as H_J. In addition, a detection mark vector M can be set for the comprehensive projection_JThe dimension J of the detection marker vector is the number of columns included in the synthetic projection, and the numerical value of one dimension in the detection marker vector identifies whether the corresponding column is removed from the synthetic projection or not, and the detection markerThe initial value in the notation vector is 0. When the value of one dimension in the detection mark vector is 1, the row of pixel points of the comprehensive projection corresponding to the dimension is removed from the comprehensive projection.

S802: and determining the number of each column of pixel points in the comprehensive projection as the length value of each column of pixel points.

In the embodiment of the application, the number of each column of pixel points in the synthetic projection can be determined and used as the length value of each column of pixel points.

In the specific implementation, the number of pixels with non-zero identification values in each column of the comprehensive projection in the projection image can be determined and used as the length value of each column of pixels, and for convenience of subsequent calculation, the length values of each column of pixels can be combined into a vector and recorded as C_J。

S803: and determining the maximum column corresponding to the maximum value obtained by the second discriminant function according to the height value and the length value of each column of pixel points.

The second discrimination function may be obtained by a relationship between a target height value and a target length value when the forward projection corresponds to the human body.

In a particular implementation, the second discrimination function may be

H and d can be the height value and the length value of each column of pixel points respectively; u, v may be a tuning parameter. The subscript of the maximum column can then be determined according to the following formula:

s804: determining whether the maximum column is a local height peak column within a first search area centered on the maximum column; if yes, go to step S805.

In an embodiment of the present application, after the maximum column is determined, a first search area may be determined in the synthetic projection centered on the maximum column. For example, a first search area is determined in the synthetic projection by taking the maximum column as the center and the search radius s as the radius, and the first search area is an area from the left side s column of the maximum column to the right side s column of the maximum column in the synthetic projection.

Then, it may be determined whether the height of the maximum column is the local height peak of the first search area, that is, whether the maximum column is the local height peak column within the first search area, if so, S805 may be performed.

It should be noted that, in an actual scene, if the maximum column is a local height peak column in the first search area, it indicates that the pixel point with the maximum height of the maximum column corresponds to the vertex of the human head.

In particular implementations, the subscripts of the columns of the first search area corresponding to local height peaks may be determined by the following formula

If i_max＝i_locThis may mean that the largest column is determined to be the local height peak column in the first search area.

S805: and determining the first search area as a human body, updating the comprehensive projection in a first mode, and executing S803 on the updated comprehensive projection, namely determining the maximum column corresponding to the maximum value obtained by the second discriminant function according to the height value and the length value of each column of pixel points.

The first method may be to eliminate the first search area from the synthetic projection.

It is to be understood that, if it is determined that the maximum column is not the local peak height column in the first search area, in a possible implementation manner, the method for determining whether the maximum column is the local peak height column in the first search area centered on the maximum column in S803 may further include:

if not, updating the comprehensive projection through a second mode, and executing S803 on the updated comprehensive projection, namely determining the maximum column corresponding to the maximum value obtained through the second discriminant function according to the height value and the length value of each column of pixel points.

The second mode may be to remove the second search area from the synthetic projection. The second search area is an area in which the number of columns of the maximum column center is smaller than the number of columns of the first search area.

The determination of the second search area is exemplified based on the aforementioned first search area, centered on the maximum column in the synthetic projection, by the search radius α_ms is the radius to determine a second search area which is α from the left side of the maximum column in the synthetic projection_ms column to maximum column right α_mThe area between s columns, wherein α_mIs a fixed parameter, and α_m<1。

Therefore, the projection part which is detected in the comprehensive projection can be removed, repeated calculation is avoided, the calculation amount is reduced, and the detection efficiency is improved.

In this embodiment of the application, in order to improve accuracy of human body detection, in a possible implementation manner, the method may further include:

s901: and determining the maximum length value according to the length value of each column of pixel points in the comprehensive projection.

In the embodiment of the present application, a maximum length value of the synthetic projection may be determined.

In a specific implementation, the maximum length value of the synthetic projection can be determined by the following formula: c. C_max＝max{C_J(i)|0≤i＜w_J}。

Then, the method for determining whether the maximum column is the local height peak column in the first search area centered on the maximum column in S804 may include:

s902: it is determined whether the maximum column is a local height peak column and the maximum length value meets a trustworthiness condition within a first search area centered on the maximum column.

Wherein the credibility condition is determined according to a relation between a target maximum length value when the forward projection corresponds to the human body and the length of the target maximum column.

In a particular implementation, the trustworthiness condition may be that the length value of the maximum column is no less than a first parameter multiple of the maximum length value. I.e. C_J(i_max)≥α_hc_maxWherein, α_hIs the first parameter. Alternatively, the confidence condition may be a length value of the maximum columnNot less than a second parameter times the height value of the largest column. I.e. C_J(i_max)≥β_hH_J(i_max) Wherein, β_hIs the second parameter.

In the embodiment of the present application, it is determined whether the maximum length value of the synthetic projection meets the credibility condition, in addition to determining whether the maximum column is the local height peak column in the first search area centered on the maximum column in step S804.

Next, the human body detection method is described, referring to fig. 6, which shows a flowchart of a human body detection method provided in an embodiment of the present application, and as shown in fig. 6, first, a maximum column of the synthetic projection may be determined. It may then be determined whether the height value of the largest column is a local peak within the first search area. If so, it may be determined whether the maximum length value of the synthetic projection meets a plausibility condition. If the confidence condition is met, the columns in the first search area (the number of columns corresponding to [ i ] may be assigned_max-s,i_max+s]) And setting the value in the corresponding detection mark vector as 1, so as to remove the first search area from the comprehensive projection, namely updating the comprehensive projection in a first mode, determining the first search area as a human body and storing the human body detection result. If it is determined that the height value of the largest column is not a local peak in the first search area, or if it is determined that the maximum length value of the synthetic projection does not meet the confidence condition, the columns in the second search area (the number of columns may correspond to [ i ], [_max-α_ms,i_max+α_ms]) The value in the corresponding detection flag vector is set to 1, so that the second search area is removed from the synthetic projection, i.e. the synthetic projection is updated by the second means. If M is_JIf the elements in (1) are all 1, the determination can be finished, otherwise, the maximum column in the synthetic projection is determined again.

In addition, the local peak value algorithm can replace the quality of the apparent depth image with other human body detectors, and the column of the detection result is taken as the peak value i_max。

In consideration of the loss of details caused by the quality of the depth map in the related technology, the local texture features may not describe the human body, and by applying the two statistical features of the height value and the density value, robustness is provided for the human body posture change and the mutual approaching of the human bodies during human body detection, and the accuracy of the human body detection is improved.

Based on the object detection method, an embodiment of the present application further provides an object detection apparatus, as shown in fig. 7, which shows a structural diagram of the object detection apparatus provided in the embodiment of the present application, and as shown in fig. 7, the apparatus includes:

an obtaining unit 701, configured to obtain a depth image to be detected;

a segmentation unit 702, configured to segment a foreground region from the depth image according to depth values of pixel points in the depth image;

a generating unit 703, configured to generate a point cloud corresponding to the foreground region;

a detecting unit 704, configured to detect whether a target object is included in the depth image according to the point cloud.

In a possible implementation manner, the dividing unit 702 is specifically configured to:

the pixel point in the depth image corresponds to a background probability condition, and whether the depth value of a target pixel point meets the corresponding background probability condition or not is determined for the target pixel point in the depth image, wherein the target pixel point is any one pixel point in the depth image;

if not, determining the target pixel point as a foreground pixel point belonging to the foreground area.

In a possible implementation manner, the apparatus further includes an updating unit, configured to:

after detection of a continuous T1 frame depth image in a video is completed, for a first set including pixel points at a target position in the continuous T1 frame depth image, if the number of foreground pixel points in the first set is not less than a first number threshold, updating a background probability condition corresponding to the pixel points at the target position to a foreground probability condition, where the foreground probability condition is used to determine whether the pixel points at the target position in the depth image after the continuous T1 frame depth image are foreground pixel points belonging to a foreground region, and the foreground probability condition is determined according to depth values of the pixel points at the target position in the continuous T1 frame depth image.

In a possible implementation manner, the updating unit is further configured to:

if the pixel point at the target position corresponds to a foreground probability condition, after detection of a continuous T2 frame depth image in a video is completed, updating a foreground probability condition corresponding to the pixel point at the target position to a background probability condition for a second set including the pixel point at the target position in the continuous T2 frame depth image, if the number of the background pixel points in the second set is not less than a second number threshold, where the background probability condition is used to determine whether the pixel point at the target position in the depth image after the continuous T2 frame depth image is a foreground pixel point belonging to a foreground region, and the new background probability condition is determined according to the depth value of the pixel point at the target position in the continuous T2 frame depth image.

after the detection of the continuous T3 frame depth images in the video is completed, for a third set including pixel points at a target position in the continuous T3 frame depth images, if the number of pixel points belonging to zero depth values in the third set is not less than a third number threshold, the target pixel points do not include the pixel points at the target position in the depth images.

In a possible implementation manner, the detecting unit 704 is specifically configured to:

determining world coordinates of point cloud elements in the point cloud;

generating a height map and a density map according to the world coordinates; wherein pixel points in the height map and density map correspond to ground locations that project the point cloud into a ground coordinate system, the pixel points in the height map identify a maximum height value for the corresponding ground location, the maximum height value being a maximum of the heights corresponding to the point cloud elements projected to the ground location; pixel points in the density map identify the number of point cloud elements projected to a corresponding ground location;

determining a predicted ground position of the target object according to the height map and the density map;

partitioning the point cloud into sub-point clouds according to the predicted ground location;

determining whether the sub-point cloud corresponds to a target object.

In a possible implementation manner, the detecting unit 704 is further specifically configured to:

determining the predicted ground position of the target object through a first discriminant function according to the height map and the density map; the first discriminant function is determined according to a relation between a target height and a target density, wherein the target height is obtained by projecting a point cloud corresponding to a target object onto the ground.

In a possible implementation manner, the detecting unit 704 is further specifically configured to determine a forward projection of the sub-point cloud along a direction parallel to the ground according to a shooting angle of a camera;

determining whether the forward projection corresponds to a target object.

In one possible implementation, the target object is a human body.

if the forward projection of the sub-point cloud along the direction parallel to the ground is determined, taking the maximum height value of each row of pixel points in the comprehensive projection as the height value of each row of pixel points, wherein the comprehensive projection comprises the forward projection of the sub-point cloud;

determining the number of each row of pixel points in the comprehensive projection as the length value of each row of pixel points;

determining a maximum column corresponding to the maximum value obtained through a second discrimination function according to the height value and the length value of each column of pixel points, wherein the second discrimination function is obtained by the relation between the target height value and the target length value when the forward projection corresponds to the human body;

determining whether the maximum column is a local height peak column within a first search area centered on the maximum column;

if so, determining the first search area as a human body, updating the comprehensive projection in a first mode, executing the step of obtaining a maximum column corresponding to the maximum value through a second discriminant function according to the height value and the length value of each column of pixel points on the updated comprehensive projection; the first way is to eliminate the first search area from the synthetic projection.

if not, updating the comprehensive projection through a second mode, and executing the step of obtaining a maximum column corresponding to the maximum value through a second discriminant function according to the height value and the length value of each column of pixel points on the updated comprehensive projection; the second mode is to eliminate a second search area from the comprehensive projection; the second search area is an area smaller than the first search area centered on the maximum column.

determining a maximum length value according to the length value of each column of pixel points in the comprehensive projection;

then, the determining whether the maximum column is a local height peak column within a first search area centered on the maximum column comprises:

determining whether the maximum column is a local height peak column in a first search area centered on the maximum column and whether the maximum length value meets a credibility condition, wherein the credibility condition is determined according to a relation between a target maximum length value when the forward projection corresponds to a human body and the length of the target maximum column.

An embodiment of the present application further provides an apparatus for object detection, where the apparatus includes a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the above object detection method according to instructions in the program code.

An embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium is used for storing a program code, and the program code is used for executing the object detection method.

As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. an object detection method, is characterized in that, described method comprises:

Obtain the depth image to be detected;

Segmenting a foreground area from the depth image according to the depth value of the pixel in the depth image;

generating a point cloud corresponding to the foreground area;

According to the point cloud, it is detected whether a target object is included in the depth image.

2. The method according to claim 1, wherein the pixels in the depth image correspond to a background probability condition, and for the target pixels in the depth image, the pixel points in the depth image are The depth value of , and segment the foreground region from the depth image, including:

Determine whether the depth value of the target pixel satisfies the corresponding background probability condition, and the target pixel is any pixel in the depth image;

If not, it is determined that the target pixel is a foreground pixel belonging to the foreground area.

3. The method according to claim 2, wherein the method further comprises:

After the detection of the continuous T1 frames of depth images in the video is completed, for the first set including the pixels at the target position in the continuous T1 frames of depth images, if the number of foreground pixels in the first set is not large At the first number threshold, the background probability condition corresponding to the pixel point on the target position is updated to the foreground probability condition, and the foreground probability condition is used to determine that the target is located in the depth image after the continuous T1 frame depth image. Whether the pixel at the position is a foreground pixel belonging to the foreground area, the foreground probability condition is determined according to the depth value of the pixel at the target position in the continuous T1 frame depth image.

4. The method according to claim 3, wherein, if the pixels on the target position correspond to a foreground probability condition, the method further comprises:

After the detection of the continuous T2 frames of depth images in the video is completed, for the second set including the pixels at the target position in the continuous T2 frames of depth images, if the number of background pixels in the second set is not large At the second number threshold, the foreground probability condition corresponding to the pixel point on the target position is updated to a background probability condition, and the background probability condition is used to determine that the target is located in the depth image after the continuous T2 frame depth image. Whether the pixel at the position is a foreground pixel belonging to the foreground area, the background probability condition is determined according to the depth value of the pixel at the target position in the continuous T2 frame depth image.

5. The method according to claim 2, wherein the method further comprises:

After the detection of the continuous T3 frames of depth images in the video is completed, for the third set including the pixels at the target position in the continuous T3 frames of depth images, if the number of pixels belonging to the zero depth value in the third set is If not less than the third number threshold, the target pixel point does not include the pixel point located at the target position in the depth image.

6. The method according to claim 1, wherein the detecting whether the depth image includes a target object according to the point cloud comprises:

determining the world coordinates of point cloud elements in the point cloud;

A height map and a density map are generated according to the world coordinates; wherein, the pixels in the height map and the density map correspond to the ground positions where the point cloud is projected to the ground coordinate system, and the pixels in the height map Identify the maximum height value of the corresponding ground position, and the maximum height value is the maximum value in the heights corresponding to the point cloud elements projected to the ground position; the pixel point identification in the density map is projected to the corresponding ground position. the number of point cloud elements;

determining the predicted ground position of the target object according to the height map and the density map;

dividing the point cloud into sub-point clouds according to the predicted ground position;

It is determined whether the sub-point cloud corresponds to the target object.

7. The method according to claim 6, wherein the determining the predicted ground position of the target object according to the height map and the density map comprises:

According to the height map and the density map, the predicted ground position of the target object is determined by a first discriminant function; wherein, the first discriminant function is based on the target height and target height of projecting the point cloud corresponding to the target object to the ground. The relationship between the densities is determined.

8. The method according to claim 6, wherein the determining whether the sub-point cloud corresponds to a target object comprises:

According to the shooting angle of the camera, determine the forward projection of the sub-point cloud along the direction parallel to the ground;

It is determined whether the forward projection corresponds to a target object.

9 . The method according to claim 1 , wherein the target object is a human body, and if the forward projection of the sub-point cloud along a direction parallel to the ground is determined, the method of determining whether the forward projection is Corresponds to the target object, including:

Taking the maximum height value of each column of pixel points in the comprehensive projection as the height value of each column of pixel points, the comprehensive projection includes the forward projection of the sub-point cloud;

Determine the number of pixels in each column in the comprehensive projection as the length value of each column of pixels;

According to the height value and length value of each column of pixel points, determine the maximum column corresponding to the maximum value obtained by the second discriminant function, and the second discriminant function is the target height value and the target when the forward projection corresponds to the human body The relationship between the length values is obtained;

If yes, determine the first search area as a human body, update the comprehensive projection in the first way, and perform the height value and length value of the pixel points in each column on the updated comprehensive projection, and determine the comprehensive projection through the first method. The step of obtaining the largest column corresponding to the largest value by the second discriminant function; the first method is to exclude the first search area from the comprehensive projection.

10. The method according to claim 9, wherein the determining whether the maximum column is a local height peak column in the first search area centered on the maximum column comprises:

If not, update the comprehensive projection in the second way, and perform the step of determining the maximum column corresponding to the maximum value obtained by the second discriminant function according to the height value and length value of the pixel points of each column on the updated comprehensive projection. The second method is to exclude the second search area from the comprehensive projection; the second search area is an area smaller than the first search area with the center of the maximum column.

11. The method according to claim 9 or 10, wherein the method further comprises:

Determine the maximum length value according to the length value of each column of pixel points in the comprehensive projection;

Then, the determining whether the maximum column is a local height peak column in the first search area centered on the maximum column includes:

determining whether the maximum column is a local height peak column within a first search area centered on the maximum column and whether the maximum length value meets a confidence condition based on the forward The projection is determined by the relationship between the target maximum length value and the length of the target maximum column when the projection corresponds to the human body.

12. An object detection device, characterized in that the device comprises:

an acquisition unit for acquiring the depth image to be detected;

a segmentation unit, configured to segment a foreground region from the depth image according to the depth value of the pixel in the depth image;

a generating unit for generating a point cloud corresponding to the foreground area;

A detection unit, configured to detect whether a target object is included in the depth image according to the point cloud.

13. A device for object detection, wherein the device comprises a processor and a memory:

the memory is used to store program code and transmit the program code to the processor;

The processor is configured to execute the object detection method of claims 1-11 according to the instructions in the program code.