CN108875504B

CN108875504B - Image detection method and image detection device based on neural network

Info

Publication number: CN108875504B
Application number: CN201711107369.2A
Authority: CN
Inventors: 林孟潇; 张祥雨
Original assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Current assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Priority date: 2017-11-10
Filing date: 2017-11-10
Publication date: 2021-07-23
Anticipated expiration: 2037-11-10
Also published as: CN108875504A

Abstract

The embodiment of the disclosure provides an image detection method and an image detection device based on a neural network. The method comprises the following steps: performing feature extraction on the image to obtain image features; detecting a human head region of a human body in the image based on the image features; and determining a human body region corresponding to the human head region in the image based on the detection result of the human head region, wherein the human body region comprises the human head region and the human body region. Compared with the traditional detection device, the detection speed and the detection efficiency are improved.

Description

Image detection method and image detection device based on neural network

Technical Field

The embodiment of the disclosure relates to an image detection method based on a neural network and an image detection device corresponding to the method.

Background

Pedestrian Detection (Pedestrian Detection) is a technique that determines whether a Pedestrian is present in an image or video sequence and gives accurate positioning. Due to the characteristics of rigid and flexible objects, the pedestrian detection is very challenging due to the fact that the appearance of the pedestrian is easily influenced by wearing, size, shielding, posture, visual angle and the like.

When people are dense and people overlap each other, a large number of missed detections or a phenomenon that a plurality of people are regarded as one person occur because some algorithms are needed to eliminate possible overlapping, which is particularly disadvantageous to tasks such as people flow statistics. On the other hand, when the number of people in the picture is small, the existing method can perform unnecessary calculation on a large number of areas in the picture, thereby wasting system resources and influencing calculation efficiency.

Disclosure of Invention

An object of the embodiments of the present invention is to provide an image detection method and an image detection apparatus based on a neural network, so as to solve the above technical problems.

According to at least one embodiment of the present disclosure, there is provided a neural network-based image detection method, the method including: performing feature extraction on the image to obtain image features; detecting a human head region of a human body in the image based on the image features; and determining a human body region corresponding to the human head region in the image based on the detection result of the human head region, wherein the human body region comprises the human head region and the human body region.

For example, the step of detecting the human head region of the human body in the image based on the image feature includes: inputting the image into a first neural network, wherein the first neural network is used for extracting a human head area in the image; outputting at least one head region candidate box in the image from the first neural network.

For example, based on the image features, the step of detecting the human head region of the human body in the image further comprises: outputting a score for each of at least one head region candidate box of the image from the first neural network, the score representing a likelihood that the region is a head region; comparing the score with a preset head score threshold; and determining the head area with the score larger than the preset head score threshold value as the head area.

For example, the step of determining a human body region in the image corresponding to the human head region based on the detection result of the human head region includes: acquiring relative position parameters of a human head and a human body obtained based on machine learning; and determining at least one human body region candidate frame corresponding to the human head region in the image according to the relative position parameter.

For example, the step of determining a human body region corresponding to the human head region in the image based on the detection result of the human head region further includes: acquiring a preset estimation parameter, wherein the preset estimation parameter represents the number of at least one human body region candidate frame corresponding to the human head region in the image; and determining the number of human body region candidate frames corresponding to the human head region in the image based on the preset estimation parameters.

For example, the step of determining a human body region corresponding to the human head region in the image based on the detection result of the human head region further includes: determining a score value for each human body region candidate box, the score value representing a likelihood that the extracted human body region candidate box is a human body region; selecting at least one of the number of human body region candidate boxes as the human body region based on the score value.

For example, the step of determining the score value of each human body region candidate box includes: inputting the number of human body region candidate frame images into a trained second neural network, wherein the second neural network is used for determining a fraction value of each human body region candidate frame image; outputting the score value corresponding to each human body region candidate box from the second neural network.

For example, the step of determining a human body region corresponding to the human head region in the image based on the detection result of the human head region further includes: and correcting the human body area candidate frame to obtain a corrected human body area.

For example, the step of correcting the human body region candidate frame includes: inputting the human body region candidate frame image into the third neural network, wherein the third neural network is used for correcting the human body region candidate frame;

outputting the correction result of each human body region candidate frame from the third neural network.

For example, the step of outputting the correction result of each of the human body region candidate boxes from the third neural network includes: the third neural network determines an original region of the human body region candidate box in the image based on the human body region candidate box; determining whether the original region is complete; and when the original area is incomplete, correcting the human body area candidate frame corresponding to the original area.

For example, when the original region is incomplete, the step of correcting the human body region candidate frame corresponding to the original region includes: acquiring a plurality of standard body frames output by the trained third neural network; matching the human body region candidate frame with the plurality of standard body frames; taking the standard body frame with the highest matching rate as a correction frame corresponding to the human body region candidate frame; and correcting the corresponding human body region candidate frame based on the correction frame.

For example, the modified parameters include: the position of the center point of the region, the width of the region and the height of the region.

For example, the method further comprises: and carrying out non-maximum suppression post-processing on the human head area and the human body area corresponding to the human head area in the image so as to obtain the human body detected in the image.

For example, the step of performing non-maximum suppression post-processing on the human head region and a human body region corresponding to the human head region in the image to obtain a human body detected in the image includes: determining the score value of each of a plurality of human body regions corresponding to the human head region when there are a plurality of human body regions corresponding to any one of the human head regions; and determining the human body area with the highest score value as the human body area corresponding to the human head area.

For example, the plurality of head regions include a first head region and a second head region, and the step of performing non-maximum suppression post-processing on the head region and a body region corresponding to the head region in the image to obtain the body detected in the image further includes: when the first head area and the second head area are overlapped, determining the ratio of the overlapped area to the union area of the first head area and the second head area; when the ratio is larger than a first ratio threshold, acquiring a first human body area corresponding to the first human head area and a second human body area corresponding to the second human head area; and comparing the score values of the first human body region and the second human body region, and when the score value of the first human body region is larger than the score value of the second human body region, determining the first human body region and a first human head region corresponding to the first human body region as the finally detected human body region and human head region.

According to at least one embodiment of the present disclosure, there is provided a neural network-based image detection apparatus including: a memory, a processor, the memory storing program instructions, the processor when processing the program instructions performing: performing feature extraction on the image to obtain image features; detecting a human head region of a human body in the image based on the image features; and determining a human body region corresponding to the human head region in the image based on the detection result of the human head region, wherein the human body region comprises the human head region and the human body region.

For example, detecting a human head region of a human body in the image based on the image feature includes: inputting the image into a first neural network, wherein the first neural network is used for extracting a human head area in the image; outputting at least one head region candidate box in the image from the first neural network.

For example, detecting the human head region of the human body in the image based on the image feature further includes: outputting a score for each of at least one head region candidate box of the image from the first neural network, the score representing a likelihood that the region is a head region; comparing the score with a preset head score threshold; and determining the head area with the score larger than the preset head score threshold value as the head area.

For example, the determining, based on the detection result of the human head region, a human body region in the image corresponding to the human head region includes: acquiring relative position parameters of a human head and a human body obtained based on machine learning; and determining at least one human body region candidate frame corresponding to the human head region in the image according to the relative position parameter.

For example, the determining, based on the detection result of the human head region, a human body region in the image corresponding to the human head region further includes: acquiring a preset estimation parameter, wherein the preset estimation parameter represents the number of at least one human body region candidate frame corresponding to the human head region in the image; and determining the number of human body region candidate frames corresponding to the human head region in the image based on the preset estimation parameters.

For example, the determining, based on the detection result of the human head region, a human body region in the image corresponding to the human head region further includes: determining a score value for each human body region candidate box, the score value representing a likelihood that the extracted human body region candidate box is a human body region; selecting at least one of the number of human body region candidate boxes as the human body region based on the score value.

For example, determining the score value for each human region candidate box includes: inputting the number of human body region candidate frame images into a trained second neural network, wherein the second neural network is used for determining a fraction value of each human body region candidate frame image; outputting the score value corresponding to each human body region candidate box from the second neural network.

For example, the determining, based on the detection result of the human head region, a human body region in the image corresponding to the human head region further includes: and correcting the human body area candidate frame to obtain a corrected human body area.

For example, the modifying the human body region candidate frame includes: inputting the human body region candidate frame image into the third neural network, wherein the third neural network is used for correcting the human body region candidate frame; outputting the correction result of each human body region candidate frame from the third neural network.

For example, outputting the correction result of each of the human body region candidate boxes from the third neural network includes: the third neural network determines an original region of the human body region candidate box in the image based on the human body region candidate box; determining whether the original region is complete; and when the original area is incomplete, correcting the human body area candidate frame corresponding to the original area.

For example, when the original region is incomplete, the correcting the human body region candidate frame corresponding to the original region includes: acquiring a plurality of standard body frames output by the trained third neural network; matching the human body region candidate frame with the plurality of standard body frames; taking the standard body frame with the highest matching rate as a correction frame corresponding to the human body region candidate frame; and correcting the corresponding human body region candidate frame based on the correction frame.

For example, the apparatus further comprises: and carrying out non-maximum suppression post-processing on the human head area and the human body area corresponding to the human head area in the image so as to obtain the human body detected in the image.

For example, performing non-maximum suppression post-processing on the human head region and a human body region corresponding to the human head region in the image to obtain the human body detected in the image includes: determining the score value of each of a plurality of human body regions corresponding to the human head region when there are a plurality of human body regions corresponding to any one of the human head regions; and determining the human body area with the highest score value as the human body area corresponding to the human head area.

There is also provided, in accordance with at least one embodiment of the present disclosure, an apparatus for image detection based on a neural network, the apparatus including: an acquisition unit configured to perform feature extraction on the image to obtain an image feature; a detection unit configured to detect a human head region of a human body in the image based on the image feature; a determination unit configured to determine a human body region corresponding to the human head region in the image based on a detection result of the human head region, the human body region including the human head region and a human body region.

There is also provided, in accordance with at least one embodiment of the present disclosure, a computer-executable non-volatile storage medium having stored therein program instructions that are loaded by a processor of the computer and which perform the steps of the method of the above-described embodiment.

According to the embodiment of the disclosure, the head area in the image is detected from the image to be detected, and then the pedestrian is detected near the position of the head area on the basis of the head area, so that the detection speed and the detection efficiency are improved compared with the traditional detection device.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. The drawings in the following description are merely exemplary embodiments of the invention.

FIG. 1 shows a flow diagram of an image detection method according to an embodiment of the invention;

fig. 2 illustrates a method of correcting a human body region candidate box according to an embodiment of the present invention;

FIG. 3 shows a flow chart of a human detection method according to an embodiment of the invention;

FIG. 4 shows an architecture diagram of an image inspection device according to an embodiment of the invention;

FIG. 5 illustrates an image detection apparatus according to an embodiment of the present disclosure;

FIG. 6 shows an example of a convolution kernel according to an embodiment of the present invention.

Detailed Description

Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. Note that in the present specification and the drawings, steps and elements having substantially the same structure are denoted by the same reference numerals, and repeated explanation of the steps and elements will be omitted.

FIG. 1 illustrates a neural network-based image detection method 100, according to an embodiment of the invention. Referring to fig. 1, the image detection method 100 may include the following steps.

In step S101, feature extraction is performed on the image to obtain image features. According to an example of the present invention, the trained neural network may be used to perform feature extraction on the image, and methods such as SIFT (Scale-invariant feature transform), HOG (Histogram of oriented gradients), and the like may also be adopted.

Convolutional Neural Networks (CNN) are locally connected networks. Compared with a fully-connected network, the method has the following main characteristics: local connectivity and weight sharing. Since for a certain pixel p in an image, pixels closer to the pixel p generally have a larger influence on it (local connectivity). In addition, according to the statistical characteristics of the natural image, the weight of a certain region may be used for another region (weight sharing). The weight sharing can be convolution kernel sharing, for one convolution kernel, the convolution kernel is convolved with a given image to extract the characteristics of one image, and different convolution kernels can extract different image characteristics. For example, the convolutional layer calculation method can be calculated according to the following formula:

where "σ" represents an activation function; "i mg Mat" represents a gray scale image matrix; "W" represents a convolution kernel; "

"denotes a convolution operation; "b" represents an offset value.

According to one example of the present invention, feature extraction may be performed on an image by CNN. FIG. 6 shows an example of a convolution kernel according to an embodiment of the present invention. The convolution kernel may adopt the first convolution kernel shown in fig. 6, where Gx denotes the horizontal direction and Gy denotes the vertical direction.

The image is first convolved with a first convolution kernel-Gx, e.g. using the method of equation (1)

Here the convolution kernel size may be 3x3 and the image size may be 512x 512. The convolved image size may be (512-3+1) x (512-3+1) if the convolution is performed directly without any other processing of the image. And adding a bias value b to each element of the result obtained after the convolution, namely a matrix, and inputting each element in the obtained matrix into an activation function to obtain an image feature extraction result. The activation function may be, for example:

in addition, the first convolution kernel-Gy can be further utilized to carry out feature extraction, and another image feature extraction result can be obtained. In the example, two convolution kernels are used. Each convolution kernel extracts a different image feature. Those skilled in the art will appreciate that tens or tens of convolution kernels may also be used to extract image features.

According to another example of the present invention, the image may be further subjected to Feature extraction by using a Scale Invariant Feature Transform (SIFT) algorithm. The SIFT algorithm has invariance to translation, rotation and scale change, and has good robustness to noise, view angle change, illumination change and the like.

The SIFT feature extraction is carried out on the selected image, mainly aiming at improving the calculation efficiency, the selected image cannot be too small, otherwise, the detected feature points are too few, and the matching accuracy is influenced.

The method for SIFT feature extraction may include:

(1) detecting a scale space extreme value:

convolving the viewpoint image with Gaussian functions of different kernels to obtain a corresponding Gaussian image, wherein the two-dimensional Gaussian function is defined as follows:

where σ is referred to as the variance of the Gaussian function; x and y are the two dimensions of the rows and columns of the image, respectively.

Differentiating a Gaussian image formed by two Gaussian functions with a factor of k to form a DoG (difference of Gaussian) scale space of the image, as represented by the following formula:

D(x,y,σ)＝(G(x,y,kσ)-G(x,y,σ))*I(x,y)＝L(x,y,kσ)-L(x,y,σ)；

and taking 3 adjacent scales of the DoG scale space, comparing each pixel point in the middle layer with the pixel points at the adjacent positions of the same layer and the upper layer and the lower layer one by one, and if the point is a maximum value or a minimum value, changing the point to be a candidate feature point in the scale.

(2) Positioning the characteristic points:

since the DoG value is sensitive to noise and edges, it is necessary for local extreme points to accurately determine the positions and dimensions of candidate feature points by taylor expansion, and simultaneously remove low-contrast feature points.

(3) Determining the main direction of the feature points:

the main direction of the feature points is determined and mainly used for feature point matching, and after the main direction is found, the image can be rotated to the main direction when the feature points are matched, so that the rotation invariance of the image is guaranteed. For the gradient values and directions at pixel point (x, y) are:

where m (x, y) represents the energy of the direction and θ (x, y) represents the direction.

Sampling in a neighborhood window taking the feature point as the center, and counting the gradient direction of a neighborhood pixel by using a gradient direction histogram, wherein the direction corresponding to the highest peak value point of the histogram is the main direction. So far, the feature point detection of the image is completed, and each feature point has three pieces of information: location, corresponding scale and orientation.

(4) Generating SIFT feature descriptors:

the SIFT algorithm generates the feature descriptor in a sampling region mode, in order to ensure rotation invariance, a coordinate axis is rotated as the direction of a feature point, an 8 × 8 window is taken by taking the feature point as the center, then gradient direction histograms in 8 directions are calculated on 4 × 4 image small blocks, each gradient direction accumulated value is drawn, and a seed point is formed. Then a feature point is described by 16 seed points, and each seed point has 8 directional vector information, so that each feature point can generate 128 data in total of 16 × 8, i.e. forming a 128-dimensional SIFT feature descriptor.

According to an example of the present invention, the image may also be subjected to feature extraction using a HOG (Histogram of oriented gradients). The HOG feature extraction method may include the following steps.

1) Normalizing the image

The purpose of the normalization processing operation is to: in order to improve the robustness of the image feature descriptor to illumination and environmental changes, the local shadow, the local excessive exposure and the texture distortion of the image are reduced, and the interference noise of the image is suppressed as much as possible. The normalization processing operation is realized by converting the image into a gray image and then utilizing Gamma correction.

2) Segmenting images

Because the histogram of oriented gradients HOG is a local feature descriptor that describes the local texture information of an image, it will not be possible to obtain good results if a large image and the features in the opposite direction are extracted directly. Therefore, we need to divide the image into smaller grid cells, for example, we divide the image into grid cells of 20 × 20 size in the program, then 2 × 2 cells constitute one block, and finally, all blocks constitute the image.

3) Calculating the directional gradient histogram of each grid Cell

After dividing the image into small cells, a histogram of directional gradient of each Cell is calculated, and the gradient image in the X direction and the Y direction of each small region can be solved. Then, the gradient direction and the gradient amplitude of each pixel point in each cell are calculated. After calculation, a histogram of directional gradients with X as the abscissa and Y as the ordinate being the gradient magnitude is generated.

4) Feature vector normalization

In order to overcome the variation of uneven illumination and the contrast difference between the foreground and background, the feature vector calculated in each small area needs to be normalized. For example, the normalization process is performed using a normalization function.

5) Generation of HOG feature vectors

First, the HOG feature vectors of small grid cells in the image are combined into a HOG feature vector of a comparative large block, for example, a block is formed by using 2 × 2 grid cells. And then mutually combining the HOG feature vectors of all the blocks into the HOG feature vector of the full image. The specific combination mode of the feature vectors is to combine small feature vectors into a feature vector with a larger dimension in an end-to-end connection mode. For example, an image is divided into m × n blocks, and the feature vector of each block has 9 dimensions (one for each gradient direction). The dimension of the final feature vector of this image is then m x n 9.

Those skilled in the art will appreciate that the above feature extraction methods are merely exemplary of the present invention, that there are many ways to extract features, and that other image feature extractions may be used.

In step S102, a human head region of a human body in an image is detected based on image features.

According to an example of the present invention, in order to detect a human head region of a human body in an image, a trained first neural network capable of extracting the human head region in the image may be used. Inputting an image to be detected into a first neural network, and then outputting at least one head region candidate frame in the image from the first neural network. The candidate frame of the human head area is an area which is possibly a human head area in the image, and in the subsequent steps, the candidate frame is further screened to determine the human head area.

According to an example of the present invention, in order to determine the possibility that the output human head region candidate frame is a human head region, a score of each of at least one human head region candidate frame of the image may be output from the first neural network while the human head region candidate frame is output from the first neural network, the score indicating the possibility that the region is a human head region. The score is then compared to a preset head score threshold. The preset head score threshold value can be obtained through machine learning when the first neural network is trained, and can also be set according to actual conditions. And if the score is larger than a preset head score threshold value, determining the head area candidate box as a head area. And if the score is smaller than a preset head score threshold value, determining the head area candidate box as not being the head area. In this way, significant non-human head regions can be filtered out.

In step S103, a human body region corresponding to the human head region in the image is determined based on the detection result of the human head region, the human body region including the human head region and the human body region.

According to an example of the present invention, on the basis of the human head region candidate frame obtained in step S102, in order to further detect a human body region corresponding to the human head region from the human head region, a relative position parameter of the human head and the human body obtained based on machine learning may be obtained. The relative position parameter indicates the proportional relation between the human head and the human body, and then at least one human body region candidate frame corresponding to the human head region in the image is determined according to the relative position parameter. The human body region candidate box is also a region box that may be a human body region. In another example, the relative position parameter may also be set manually.

Since a plurality of human body regions may be detected for one head region in the process of detecting the human body region, in order to obtain an appropriate number of human body region candidate frames, according to an example of the present invention, a preset estimation parameter may be first obtained, where the preset estimation parameter indicates the number of at least one human body region candidate frame corresponding to the head region in the image. And then determining the number of human body region candidate frames corresponding to the human head region in the image based on the preset estimation parameters.

After the plurality of human body region candidate boxes are acquired, in order to filter out candidate boxes that are obviously not human body regions from the plurality of human body region candidate boxes, according to an example of the present invention, a score value of each human body region candidate box, which represents a possibility that the extracted human body region candidate box is a human body region, may also be determined. Then, based on the score value, at least one of the plurality of detected human body region candidate boxes may be selected as a human body region.

For example, a trained second neural network may be used, the second neural network being used to determine a score value for each human region candidate box image. And inputting the human body region candidate boxes output in the front into a trained second neural network, and outputting a score value corresponding to each human body region candidate box from the second neural network. The score value is used to determine the likelihood that the human body region candidate box is a human body region, thereby performing efficient screening.

In addition, in order to improve the accuracy of pedestrian detection, according to an example of the present invention, after the second neural network outputs the human body region candidate frame, the human body region candidate frame may be further corrected to obtain a corrected human body region.

According to an example of the present invention, the human region candidate box is modified using a trained third neural network. For example, the human body region candidate box is input into the third neural network, and then the correction result of each human body region candidate box is output from the third neural network.

Fig. 2 shows a method 200 for modifying a human body region candidate box according to an embodiment of the invention. Referring to fig. 2, the human body region candidate box correcting method may include the following steps.

In step S201, the third neural network determines an original region of the human body region candidate frame in the image based on the human body region candidate frame. For example, when training the third neural network, the image subjected to feature extraction in the previous step S101 may be used as an original image, i.e., an image to be detected. The original region is the region of the candidate frame of the human body region in the image to be detected. As an example, the original region may be cut out from the original image according to the position of the human body region candidate frame in the original image.

In step S202, it is determined whether the original area is complete. For example, whether the human body or the human head in the original region is intact can be judged according to the human body scale, the normal scale range and the size range of the human body part.

In step S203, when the original region is incomplete, the human body region candidate frame corresponding to the original region is corrected. As an example, when performing the correction, a plurality of standard body frames output by the trained third neural network are first acquired. The standard body frame is obtained according to the labeling data of a plurality of samples of normal human bodies. Thus, there may be a plurality of standard body frames, for example, different standard body frames for persons of different ages and sexes. The human region candidate box is then matched against a plurality of standard body boxes. And taking the standard body frame with the highest matching rate as a correction frame corresponding to the human body region candidate frame. That is, the standard body frame with the highest matching rate is closest to the human body region candidate frame, and the human body region candidate frame may be corrected with reference to the standard body frame with the highest matching rate. Parameters that need to be modified may include, for example: and at least one of the position of the center point of the human body region candidate frame, the width of the region and the height of the region.

As an example, the parameter correction method may employ the following method. Assume that the standard body frame is a rectangular frame, x is an abscissa of a center point of the standard body frame, y is an ordinate of the center point of the standard body frame, w is a width of the standard body frame, and h is a height of the standard body frame. (x)₀，y₀) Is the coordinate of the lower left corner of the standard body frame, (x)₁，y₁) Is the coordinates of the upper right corner of the standard body box, then,

x＝(x₀+x₁)/2，y＝(y₀+y₁)/2，w＝x₁-x₀，h＝y₁-y₀。

suppose that the candidate frame for the body region is also a rectangular frame, x_aAbscissa, y, of the center point of the candidate frame of the human body region_aOrdinate, w, of the center point of a candidate frame for a human body region_aWidth of human body region candidate box, h_aHeight of the human body region candidate box.

Assuming that the corrected human body region candidate frame is also a rectangle, x 'is an abscissa of a center point of the corrected human body region candidate frame, y' is an ordinate of the center point of the corrected human body region candidate frame, w 'is a width of the corrected human body region candidate frame, and h' is a height of the corrected human body region candidate frame.

The correction offset t can be obtained according to the following formula_x，t_y，t_w，t_hThe regression target of (1):

t_x＝(x-x_a)/w_a，t_y＝(y-y_a)/h_a，

t_w＝log(w/w_a)，t_h＝log(h/h_a)，

t’_x＝(x’-x_a)/w_a，t’_y＝(y’-y_a)/h_a，

t’_w＝log(w’/w_a)，t’_h＝log(h’/h_a)。

in training the third neural network, predicting t 'according to the image features, the human body region candidate frame and the parameters of the standard body frame'_x，t′_y，t′_w，t′_hSo that ∑ (t)_i-t′_i)²As small as possible, where i ∈ { x, y, w, h }. When ∑ (t)_i-t′_i)²Upon convergence, the third neural network training is completed, and then t 'output by the trained neural network is used'_x，t′_y，t′_wt′_hTo calculate the center (x ', y') and width w 'of the corrected human body region candidate frame'Height h':

x′＝t′_xw_a+x_a，y′＝t′_yh_a+y_a

the coordinate of the lower left corner of the corrected human body region candidate frame is (x)₀′，y₀') and the coordinate of the upper right corner is (x)₁′，y₁'). Then the coordinate values are calculated using the following formula:

x′₀＝x′-w′/2

x′₁＝x′+W′/2

y′₀＝y′-h′/2

y′₁＝y′+h′/2

according to the embodiment of the invention, the output human body region candidate frame is corrected, so that the identification degree of the human body region candidate frame can be effectively improved, and the detection accuracy is improved.

Fig. 3 shows a flow chart 300 of a human body detection method according to an embodiment of the invention, and referring to fig. 3, the human body detection method may include the following steps.

In step S301, feature extraction is performed on the image to obtain image features.

In step S302, a human head region of a human body in an image is detected based on image features;

in step S303, a human body region corresponding to the human head region in the image is determined based on the detection result of the human head region, the human body region including the human head region and the human body region.

In step S304, non-maximum suppression post-processing is performed on the human head region and the human body region corresponding to the human head region in the image to obtain a human body detected in the image.

Steps S301 to S303 are respectively the same as steps S101 to S103 in the foregoing embodiment, and are not repeated herein, specifically referring to the foregoing embodiment.

In step S304, as an example, when a plurality of human head regions are detected in step S302, and a plurality of human body regions corresponding to one human head region are determined in step S303, a score value of each of the plurality of human body regions corresponding to the human head region may be further determined. For example, as previously described, a trained second neural network is employed to determine a score value for each human body region. And then determining the human body area with the highest score value as the human body area corresponding to the human head area.

In the pedestrian detection process, two human head regions overlapped with each other are likely to be detected, and in order to reduce the false alarm rate, the following processing may be performed. It is assumed that the detected head regions are at least two, including a first head region and a second head region. When the first head region and the second head region overlap, determining the ratio of the overlapping region to the union region of the first head region and the second head region. And when the ratio is greater than the first ratio threshold, acquiring a first human body region corresponding to the first human head region and a second human body region corresponding to the second human head region. And then comparing the score values of the first human body region and the second human body region, and when the score value of the first human body region is larger than the score value of the second human body region, determining the first human body region and the first human head region corresponding to the first human body region as the finally detected human body region and human head region. That is, when the overlapping human head frame occurs, only the human body region with a high score and the corresponding human head region are retained.

The image detection method of the embodiment of the invention firstly detects the human head area in the image to be detected, and then detects the pedestrian near the position of the human head area on the basis of the human head area, thereby improving the accuracy of pedestrian detection, reducing the area needing to be detected, and improving the detection speed to a certain extent compared with the traditional detection method. Meanwhile, the method can provide corresponding information of the positions of the person and the head, and provides more information for various subsequent requirements.

Fig. 4 shows an architecture diagram of an image inspection apparatus 400 according to an embodiment of the present invention. The image detection apparatus 400 corresponds to the image detection method of the foregoing embodiment, and for brevity of the description, only brief description will be made herein, and specific reference will be made to the description of the foregoing embodiment.

Referring to fig. 4, the image detection apparatus 400 includes: memory 401, processor 402, memory 401 storing program instructions, processor 402 executing when processing the program instructions: performing feature extraction on the image to obtain image features; detecting a human head region of a human body in the image based on the image features; and determining a human body region corresponding to the human head region in the image based on the detection result of the human head region, wherein the human body region comprises the human head region and the human body region.

For example, detecting a human head region of a human body in an image based on image features includes: inputting the image into a first neural network, wherein the first neural network is used for extracting a human head area in the image; at least one head region candidate box in the image is output from the first neural network.

For example, detecting the human head region of the human body in the image based on the image features further includes: outputting a score for each of at least one head region candidate box of the image from the first neural network, the score representing a likelihood that the region is a head region; comparing the score with a preset head score threshold; and determining the head area with the score larger than a preset head score threshold value as the head area.

For example, determining a human body region in the image corresponding to the human head region based on the detection result of the human head region includes: acquiring relative position parameters of a human head and a human body obtained based on machine learning; and determining at least one human body region candidate frame corresponding to the human head region in the image according to the relative position parameters.

For example, determining a human body region in the image corresponding to the human head region based on the detection result of the human head region further includes: acquiring a preset estimation parameter, wherein the preset estimation parameter represents the number of at least one human body region candidate frame corresponding to a human head region in an image; and determining the number of human body region candidate frames corresponding to the human head region in the image based on the preset estimation parameters.

For example, determining a human body region in the image corresponding to the human head region based on the detection result of the human head region further includes: determining a score value of each human body region candidate box, the score value representing a likelihood that the extracted human body region candidate box is a human body region; at least one of the number of human body region candidate boxes is selected as a human body region based on the score value.

For example, determining the score value for each human region candidate box includes: inputting the number of the human body region candidate frame images into a trained second neural network, wherein the second neural network is used for determining the fraction value of each human body region candidate frame image; and outputting the score value corresponding to each human body region candidate box from the second neural network.

For example, determining a human body region in the image corresponding to the human head region based on the detection result of the human head region further includes: and correcting the human body area candidate frame to obtain a corrected human body area.

For example, modifying the human body region candidate frame includes: inputting the human body region candidate frame image into a third neural network, wherein the third neural network is used for correcting the human body region candidate frame; and outputting the correction result of each human body region candidate frame from the third neural network.

For example, outputting the correction result of each human body region candidate box from the third neural network includes: the third neural network determines an original region of the human body region candidate frame in the image based on the human body region candidate frame;

determining whether the original area is complete; and when the original area is incomplete, correcting the human body area candidate frame corresponding to the original area.

For example, when the original region is incomplete, correcting the human body region candidate frame corresponding to the original region includes: acquiring a plurality of standard body frames output by the trained third neural network; matching the human body region candidate frame with a plurality of standard body frames; taking the standard body frame with the highest matching rate as a correction frame corresponding to the human body region candidate frame; and correcting the corresponding human body region candidate frame based on the correction frame.

For example, the modified parameters include at least one of a location of a center point of the region, a width of the region, and a height of the region.

For example, performing non-maximum suppression post-processing on a human head region and a human body region corresponding to the human head region in the image to obtain a human body detected in the image includes: when any one of the human body regions corresponds to a plurality of human body regions, determining a score value of each of the plurality of human body regions corresponding to the human head region; and determining the human body area with the highest score value as the human body area corresponding to the human head area.

For example, when there are a plurality of head regions, the plurality of head regions include a first head region and a second head region, and the step of performing non-maximum suppression post-processing on the head region and a body region corresponding to the head region in the image to obtain the body detected in the image further includes: when the first head area and the second head area are overlapped, determining the ratio of the overlapped area to the union area of the first head area and the second head area; when the ratio is larger than a first ratio threshold, acquiring a first human body area corresponding to the first human head area and a second human body area corresponding to the second human head area; and comparing the score values of the first human body area and the second human body area, and when the score value of the first human body area is larger than the score value of the second human body area, determining the first human body area and the first human head area corresponding to the first human body area as the finally detected human body area and human head area.

The image detection device provided by the embodiment of the invention detects the head area in the image from the image to be detected, and then detects the pedestrian near the position of the head area on the basis of the head area. Meanwhile, the device can provide corresponding information of the positions of the person and the head, and provides more information for various subsequent requirements.

Further, according to at least one embodiment of the present disclosure, there is also provided a computer-executable nonvolatile storage medium, which corresponds to the memory 401 in the image detection apparatus 400 in the foregoing embodiment, and for brevity of the description, only brief description will be made below, with specific reference to the description of the foregoing embodiment. The non-volatile storage medium stores therein program instructions that are loaded by a processor of the computer and that perform the steps of the method described in the above embodiments.

In addition, according to at least one embodiment of the present disclosure, there is also provided an image detection apparatus based on a neural network, which corresponds to the method of the foregoing embodiment, and for brevity of the description, only brief description is made below. Fig. 5 illustrates an image detection apparatus 500 according to an embodiment of the present disclosure. Referring to fig. 5, the image detection apparatus 500 includes: an acquisition unit 501, a detection unit 502, and a determination unit 503. For example, the acquisition unit 501 is configured to perform feature extraction on an image to obtain image features. The detection unit 502 is configured to detect a human head region of a human body in an image based on image features. The determination unit 503 is configured to determine a human body region corresponding to the human head region in the image based on the detection result of the human head region, the human body region including the human head region and the body region.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. And the software modules may be disposed in any form of computer storage media. To clearly illustrate this interchangeability of hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It should be understood by those skilled in the art that various modifications, combinations, partial combinations and substitutions may be made in the present invention depending on design requirements and other factors as long as they are within the scope of the appended claims and their equivalents.

Claims

1. A neural network-based image detection method, the method comprising:

performing feature extraction on the image to obtain image features;

detecting a human head region of a human body in the image based on the image features;

determining a human body region corresponding to the human head region in the image based on the detection result of the human head region, the human body region including the human head region and a body region,

wherein the method further comprises:

performing non-maximum suppression post-processing on the human head region and a human body region corresponding to the human head region in the image to obtain a human body detected in the image,

the step of performing non-maximum suppression post-processing on the human head region and a human body region corresponding to the human head region in the image to obtain a human body detected in the image includes:

when any one of the human body regions corresponds to a plurality of human body regions, determining a score value of each of the plurality of human body regions corresponding to the human head region;

determining the human body area with the highest score value as the human body area corresponding to the human head area,

wherein, the human head region is a plurality of, a plurality of human head regions include first human head region and second human head region, to in the image the human head region with the human body region that the human head region corresponds carries out non-maximum value and suppresses the post-processing, in order to obtain the human step of detecting in the image still includes:

when the first head area and the second head area are overlapped, determining the ratio of the overlapped area to the union area of the first head area and the second head area;

when the ratio is larger than a first ratio threshold, acquiring a first human body area corresponding to the first human head area and a second human body area corresponding to the second human head area;

and comparing the score values of the first human body region and the second human body region, and when the score value of the first human body region is larger than the score value of the second human body region, determining the first human body region and a first human head region corresponding to the first human body region as the finally detected human body region and human head region.

2. The method of claim 1, wherein detecting a human head region of a human body in the image based on the image feature comprises:

inputting the image into a first neural network, wherein the first neural network is used for extracting a human head area in the image;

outputting at least one head region candidate box in the image from the first neural network.

3. The method of claim 2, wherein detecting a human head region of a human body in the image based on the image feature further comprises:

outputting a score for each of at least one head region candidate box of the image from the first neural network, the score representing a likelihood that the region is a head region;

comparing the score with a preset head score threshold;

and determining the head area with the score larger than the preset head score threshold value as the head area.

4. The method according to claim 1, wherein the step of determining a human body region in the image corresponding to the human head region based on the detection result of the human head region comprises:

acquiring relative position parameters of a human head and a human body obtained based on machine learning;

and determining at least one human body region candidate frame corresponding to the human head region in the image according to the relative position parameter.

5. The method of claim 4, wherein the step of determining a human body region in the image corresponding to the human head region based on the detection result of the human head region further comprises:

acquiring a preset estimation parameter, wherein the preset estimation parameter represents the number of at least one human body region candidate frame corresponding to the human head region in the image;

and determining the number of human body region candidate frames corresponding to the human head region in the image based on the preset estimation parameters.

6. The method of claim 5, wherein the step of determining a human body region in the image corresponding to the human head region based on the detection result of the human head region further comprises:

determining a score value for each human body region candidate box, the score value representing a likelihood that the extracted human body region candidate box is a human body region;

selecting at least one of the number of human body region candidate boxes as the human body region based on the score value.

7. The method of claim 6, wherein the step of determining the score value for each human region candidate box comprises:

inputting the number of human body region candidate frame images into a trained second neural network, wherein the second neural network is used for determining a fraction value of each human body region candidate frame image;

outputting the score value corresponding to each human body region candidate box from the second neural network.

8. The method of claim 4, wherein the step of determining a human body region in the image corresponding to the human head region based on the detection result of the human head region further comprises:

and correcting the human body area candidate frame to obtain a corrected human body area.

9. The method of claim 8, wherein the step of modifying the human region candidate box comprises:

inputting the human body region candidate frame image into a third neural network, wherein the third neural network is used for correcting the human body region candidate frame;

10. The method of claim 9, wherein outputting the revised result for each of the human region candidate boxes from the third neural network comprises:

the third neural network determines an original region of the human body region candidate box in the image based on the human body region candidate box;

determining whether the original region is complete;

and when the original area is incomplete, correcting the human body area candidate frame corresponding to the original area.

11. The method according to claim 10, wherein when the original region is incomplete, the step of correcting the candidate frame of the human body region corresponding to the original region comprises:

acquiring a plurality of standard body frames output by the trained third neural network;

matching the human body region candidate frame with the plurality of standard body frames;

taking the standard body frame with the highest matching rate as a correction frame corresponding to the human body region candidate frame;

and correcting the corresponding human body region candidate frame based on the correction frame.

12. The method of claim 11, wherein the revised parameters comprise:

the position of the center point of the region, the width of the region and the height of the region.

13. An image detection apparatus based on a neural network, comprising: a memory, a processor, the memory storing program instructions, the processor when processing the program instructions performing:

performing feature extraction on the image to obtain image features;

the processor, when processing the program instructions, performs further:

performing non-maximum suppression post-processing on the human head region and a human body region corresponding to the human head region in the image to obtain a human body detected in the image includes:

14. The apparatus of claim 13, wherein detecting a human head region of a human body in the image based on the image feature comprises:

15. The apparatus of claim 14, wherein detecting a human head region of a human body in the image based on the image feature further comprises:

comparing the score with a preset head score threshold;

16. The apparatus of claim 13, wherein the determining a human body region in the image corresponding to the human head region based on the detection result of the human head region comprises:

17. The apparatus of claim 16, wherein the determining a body region in the image corresponding to the head region based on the detection of the head region further comprises:

18. The apparatus of claim 17, wherein the determining a body region in the image corresponding to the head region based on the detection of the head region further comprises:

19. The apparatus of claim 18, wherein determining a score value for each human region candidate box comprises:

20. The apparatus of claim 16, wherein the determining a body region in the image corresponding to the head region based on the detection of the head region further comprises:

21. The apparatus of claim 20, wherein modifying the human region candidate box comprises:

22. The apparatus of claim 21, wherein outputting the revised result for each of the human region candidate boxes from the third neural network comprises:

determining whether the original region is complete;

23. The apparatus of claim 22, wherein when the original region is incomplete, correcting the frame of the human body region candidate corresponding to the original region comprises:

24. The apparatus of claim 23, wherein the revised parameters comprise:

25. An image detection apparatus based on a neural network, the apparatus comprising:

an acquisition unit configured to perform feature extraction on the image to obtain an image feature;

a detection unit configured to detect a human head region of a human body in the image based on the image feature;

a determination unit configured to determine a human body region corresponding to the human head region in the image based on a detection result of the human head region, the human body region including the human head region and a body region,

the determination unit is further configured to perform non-maximum suppression post-processing on the head region and a human body region corresponding to the head region in the image to obtain a human body detected in the image,

wherein the determining unit performs non-maximum suppression post-processing on the head region and a body region corresponding to the head region in the image to obtain the body detected in the image includes:

the determining unit performs non-maximum suppression post-processing on the head region and a body region corresponding to the head region in the image to obtain the body detected in the image, and includes:

comparing the score values of the first human body area and the second human body area, and when the score value of the first human body area is larger than the score value of the second human body area, determining the first human body area and the first human head area corresponding to the first human body area as the finally detected human body area and human head area

A domain.

26. A computer-executable non-volatile storage medium, in which program instructions are stored, which program instructions are loaded by a processor of the computer and carry out the steps of the method of any of the preceding claims 1 to 12.