CN111144220B

CN111144220B - Personnel detection method, device, equipment and medium suitable for big data

Info

Publication number: CN111144220B
Application number: CN201911201697.8A
Authority: CN
Inventors: 钟军; 张毅; 林欣郁; 邹建红; 颜阿南; 郑宇哲; 杨希锐; 徐寿坤; 庄仁贵; 余美书
Original assignee: Fujian Nebula Big Data Application Service Co ltd
Current assignee: Fujian Nebula Big Data Application Service Co ltd
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2023-03-24
Anticipated expiration: 2039-11-29
Also published as: CN111144220A

Abstract

The invention provides a personnel detection method, a device, equipment and a medium suitable for big data, wherein the method comprises the following steps: s1: taking one frame from every continuous M frames in the image sequence of the video to obtain a sampling video sequence; s2: taking out continuous N frames from the sampled video sequence each time, and inputting the continuous N frames into a pre-classifier to obtain a preliminary human head window to be detected; s3: inputting the primary human head window to be detected into a main classifier, and further eliminating windows without human heads to obtain a secondary human head window to be detected; s4: and inputting the secondary human head window to be detected into a cluster analyzer, and correcting the detection results of each frame mutually to reduce errors and generate a high-precision human head detection result. The invention realizes human head detection under various scenes with many people based on the multi-stage cascade human head classifier, integrates static characteristics, dynamic characteristics and interframe correction, and solves the technical problem that the good human head detection effect is difficult to obtain due to interference of a complex background and insufficient detection precision in an actual scene.

Description

Personnel detection method, device, equipment and medium suitable for big data

Technical Field

The invention relates to the technical field of computers, in particular to a method, a device, equipment and a medium suitable for identifying and counting personnel of video big data.

Background

The development of scientific technology and productivity has brought about a high-speed increase in the amount of data, in which multimedia data such as video images has taken a great weight. How to efficiently process these massive data and rapidly mine valuable information from the massive data is a current research hotspot. Generally, big data has four characteristics, namely, large data volume, fast response, diverse data types and low value density. The video big data also has the characteristics, but the particularity is that the data redundancy is larger, and efficient compression coding and analysis processing are required. The intelligent video analysis technology is one of research contents of video big data, and the development trend of the intelligent video analysis technology is to continuously optimize an algorithm and combine a deep convolutional neural network to obtain a more accurate identification and classification result.

With the progress of society, people have higher and higher requirements for life, equipment with automatic performance is more and more important in order to save manpower, and a people counting device is an important aspect in life. At present, there are many technologies capable of performing intelligent people counting, such as infrared, thermal imaging, and intelligent video analysis technologies, and the people counting method based on the intelligent video analysis technology has the characteristics of high precision, flexibility, and the like, and is receiving wide attention. The people counting method based on the intelligent video analysis technology firstly uses equipment such as a camera and the like to obtain images, and then analyzes the obtained images to obtain the people counting information.

Compared with the people counting method based on human body detection, the people counting method based on head detection can greatly avoid the problem that detection targets are mutually shielded. The existing human head detection technology method based on video analysis comprises a probability statistical model method based on skin color and hair color, a Canny edge detection and Hough transformation method, an ellipse intensity gradient and color histogram feature method, a 3D ellipsoid tracking method, a multi-vision sensor fusion method and the like. The methods have two common reasons that the good effect of human head detection is difficult to obtain in an actual scene. The first reason is that the detection accuracy is insufficient because the diversity of human head features and the interference of a complex background in an actual scene are less considered. The second reason is that detection is made only on a single picture, and correlation between adjacent frames in the video image is not utilized.

Disclosure of Invention

The invention aims to solve the technical problems that a method, a device, equipment and a medium suitable for personnel detection of big data are provided, the head detection under various scenes with many people is realized based on a multi-stage cascade head classifier, static characteristics, dynamic characteristics and interframe correction are fused, and the technical problem that the good effect of the head detection is difficult to obtain due to the interference of a complex background and insufficient detection precision in an actual scene is solved.

In a first aspect, the present invention provides a people detection method suitable for big data, comprising the following steps:

s1: taking one frame from every continuous M frames in the image sequence of the video to obtain a sampling video sequence;

s2: taking out continuous N frames from the sampled video sequence every time, and inputting the continuous N frames into a pre-classifier to obtain a preliminary human head window to be detected;

s3: inputting the preliminary human head window to be detected into a main classifier, and further eliminating windows without human heads to obtain a secondary human head window to be detected;

s4: and inputting the secondary human head window to be detected into a cluster analyzer, and correcting the detection results of each frame mutually to reduce errors and generate a high-precision human head detection result.

In a second aspect, the present invention provides a people detection apparatus suitable for big data, comprising:

the acquisition module is used for taking one frame from every continuous M frames in the image sequence of the video to obtain a sampling video sequence;

the preliminary detection module is used for taking out continuous N frames from the sampling video sequence and inputting the continuous N frames into a pre-classifier to obtain a preliminary human head window to be detected;

the secondary detection module is used for inputting the primary human head window to be detected into the main classifier, further eliminating windows without human heads, and obtaining a secondary human head window to be detected;

and the correction module is used for inputting the secondary human head window to be detected into the clustering analyzer, and reducing errors and generating a high-precision human head detection result by mutually correcting the detection results of all frames.

In a third aspect, the present invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of the first aspect when executing the program.

In a fourth aspect, the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method of the first aspect.

One or more technical solutions provided in the embodiments of the present invention have at least the following technical effects or advantages: the method realizes human head detection under various human-rich scenes based on the multi-stage cascade human head classifier, and integrates static characteristics, dynamic characteristics and interframe correction. The first-stage pre-classifier can screen out the regions where the human heads are likely to appear at a high speed almost without omission, and can exclude the regions without the human heads. The second-level main classifier has high precision ratio and high recall ratio. The third level of cluster analyzers reduces head detection errors by utilizing the inherent links between adjacent frames. The three-level cascade has the advantages that the high-precision detection result is obtained, the real-time detection effect can be achieved, the comprehensive performance is good, the method can be used for performing off-line people number detection on existing massive videos, and can also be used for being embedded into an intelligent video monitoring system for real-time analysis.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

The invention will be further described with reference to the following examples with reference to the accompanying drawings.

FIG. 1 is a flow chart of a method according to one embodiment of the present invention;

fig. 2 is a flow chart for finding the center of a motion window using a three-frame difference method.

FIG. 3 is a diagram of a convolutional neural network structure constituting a main classifier;

FIG. 4 is a schematic diagram of the convolutional neural network feature extraction and process for classification that constitutes a master classifier;

FIG. 5 is a schematic diagram of a fusion method of output results of continuous N frames of images after CNN detection;

FIG. 6 is a schematic diagram of cluster analysis timing

FIG. 7 is a schematic structural diagram of an apparatus according to a second embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to a third embodiment of the invention;

fig. 9 is a schematic structural diagram of a medium according to a fourth embodiment of the present invention.

Detailed Description

The embodiment of the application provides a personnel detection method, a device, equipment and a medium suitable for big data, and is used for solving the technical problem that a good effect of head detection is difficult to obtain due to interference of a complex background and insufficient detection precision in an actual scene.

The technical scheme in the embodiment of the application has the following general idea: the method realizes human head detection under various human-rich scenes based on the multi-stage cascade human head classifier, and integrates static characteristics, dynamic characteristics and interframe correction. The first-stage pre-classifier can screen out the regions where the human heads are likely to appear at a high speed almost without omission, and can exclude the regions without the human heads. The second-level main classifier has high precision ratio and high recall ratio. And the third-level clustering analyzer reduces human head detection errors by utilizing the internal relation between adjacent frames, and performs interframe correction. The three-level cascade has the advantages that the high-precision detection result is obtained, the real-time detection effect can be achieved, and the comprehensive performance is better.

Example one

The embodiment provides a person detection method suitable for big data, as shown in fig. 1, including the following steps:

s1: a sampled video sequence is obtained by taking one frame from each continuous M (M >1 and less than the video frame rate) frames in an image sequence of a video (such as a surveillance video). The image resolution of the monitoring video is 640 multiplied by 480 or 320 multiplied by 240, the frame rate is 15fps, the compression coding format is H.264, and the video packaging format is WMV.

S2: a pre-classifier is utilized to quickly obtain a nearly-missing window containing a human head, continuous N (N > 1) frames are taken out from the sampling video sequence every time, and the continuous N frames are input into the pre-classifier to obtain a preliminary human head window to be detected; the specific process is as follows:

s21: taking successive N frames from the sampled video sequence at a time for use as input to a pre-classifier; looking for a motion window, as shown in fig. 2, taking N =3 as an example, three consecutive frames of images f are read from a sampled sequence of images each time _k-1 、f _k 、f _k+1 Graying and Gaussian blur preprocessing, respectively performing frame difference on two adjacent frames of images, selecting a threshold value by a histogram method, and thresholding the threshold value D ₁₂ And D ₂₃ And the two frame differences are ANDed D ₁₂ ∩D ₂₃ Obtaining a moving pixel P _ij Wherein i and j respectively indicate that the motion pixel is in the ith row and the jth column of the histogram.

S22: to move the pixel P _ij Nearby 4-position pixel P _i-2,j 、P _i+ 2 _,j 、P _ij 、P _i,j-2 、P _i,j+2 All are expanded into motion pixels, resulting in an expanded set of motion pixels

(assuming that the pixel Pij at the ith row and jth column is a motion pixel, then the pixels Pi-2,j, pi +2,j, pi, j-2, pi, j +2 are all considered motion pixels).

S23: the motion pixel is moved to the vertex of the image cell closest to the vertex, and a rectangular area with the vertex as the center and the size of W × H is used as a motion window.

S24: computing HOG feature vectors for the motion window: firstly, gamma correction is carried out on the gray image, and then the size and the direction of the gradient of the gray level at each pixel of the corrected image are calculated, wherein the calculation formula is as follows:

G _x (x，y)＝I(x+1，y)-I(x-1，y) (1)

G _y (x，y)＝I(x，y+1)-I(x，y-1) (2)

(x，y)∈D (5)

wherein D is a set of pixels of interest in the image;

i (x, y) is the gray-scale value of the pixel (x, y)

G _x (x, y) and G _y (x, y) are the horizontal and vertical components of the gray scale gradient at pixel (x, y), respectively

G (x, y) is the magnitude of the gray gradient at the pixel (x, y)

θ (x, y) is the direction of the gray scale gradient at pixel (x, y).

Finally, constructing a feature vector: the image is divided into a plurality of non-overlapping image cells with the size of 4 multiplied by 4, the gradient direction range (0-360 degrees) is averagely divided into 9 sections (0-40 degrees, 40-80 degrees, 80-120 degrees, 120-160 degrees, 160-200 degrees, 200-240 degrees, 240-280 degrees, 280-320 degrees, 320-360 degrees), and the histogram of the gradient direction distribution of the pixels in each cell in each section is calculated. The projection of the gradient direction of each pixel in the histogram, multiplied by the magnitude of the gradient there; forming a rectangular image block by every 4 cells, wherein each image block corresponds to 4 9-dimensional histograms; each local histogram normalizes the feature vector using the L2-norm. And for the W multiplied by H rectangular area, sliding the position of the image block by taking one image cell as a step length until the whole rectangular area is traversed to obtain the feature vectors of (W/4-1) × (H/4-1) image blocks, and connecting the vectors to obtain the gradient direction histogram of the rectangular area as the feature description of the rectangular area.

S25: and inputting the calculated HOG feature vector into a trained pre-classifier which is an HOG-SVM classifier, and screening out a window possibly containing a human head.

Wherein, the training of the HOG-SVM classifier: HOG feature vectors of the positive and negative samples are calculated and used to train the SVM classifier. SVM (Support Vector Machine, referred to as Support Vector Machine) employs a linear kernel function, C-SVC model. The data set used for training the SVM classifier comprises an office environment monitoring video, an MIT data set and an INRIA data set, wherein 2000 pictures are used as a training set, and 1000 pictures are used as a verification set; in each data set, the positive sample and the negative sample respectively account for 40% and 60%, and the positive sample covers the head pictures of all age groups, hair styles, postures and two sexes; the negative examples cover the background of walls, electronic equipment, computers, telephones, books, quilts, bags, clothing, tables, boxes, etc. All sample sizes were adjusted to 32 x 32 pixels.

S3: inputting the human head window to be detected into a main classifier, further excluding windows without human heads, and reserving the windows with human heads to obtain a secondary human head window to be detected:

the overall designed network structure is shown in fig. 3 and 4, and mainly comprises a convolution layer, a pooling layer, an activation layer, a rasterization layer, a full-connection layer and a regression layer, wherein the network input is a gray (single-channel) image with the size of 32 × 32, the gray image is subjected to a process of convolution C1, activation A1 and pooling P1, then the gray image is subjected to a process of convolution C2, activation A2 and pooling P2, and finally the gray image is subjected to a process of rasterization F, full-connection FC and regression R, and the network description is as follows:

convolutional layer C1 has 6 convolutional kernels each of 5 × 5 in size;

the activation function of the activation layer A1 is Softplus;

the pooling layer P1 adopts average pooling, and the size of the core is 2 multiplied by 2;

the convolution kernel size of convolution C2 is 7 x 7, each neuron of C2 is connected with 5 x 5 neighborhoods in 3-6 feature maps of P1, and the connection mode is shown in Table 1;

the activation function of the activation layer A2 is a Softplus function;

the pooling core size of the pooling layer P2 is 2 × 2;

the rasterization layer F sequentially takes out and arranges the elements after the pooling P2 to obtain a 256-dimensional vector which passes through a full connection layer FC containing 2 neurons;

the regression layer R makes a logistic regression through the Softmax function to obtain the probability (or score) that the classification result falls into each category.

TABLE 1 connection relationship between convolutional layers and feature layers

The data set used for training the convolutional neural network comprises an office environment monitoring video, an MIT data set and an INRIA data set, wherein 20000 pictures are used as a training set, 3000 pictures are used as a verification set, and 10000 pictures are used as a test set. There are samples sized to 32 x 32 pixels.

Taking the cross entropy loss between the classification probability output by the regression layer and the sample label (0 or 1) as an error, and reversely propagating the error by calculating a chain rule of gradients:

wherein, w _i Is the convolution kernel of the i-th layer, L is the training error, y, f _m ，f _m-1 ，...，f _i Is the output of each layer of CNN.

After calculating the gradient of each parameter in the back propagation, the parameters are updated according to the formulas (7) and (8):

w ^* ＝w-Δw ^* (8)

wherein, w is the updated parameter, α is the set learning rate, and m is the momentum for accelerating the training process.

During training, a stochastic gradient descent method is used for training the convolutional neural network, the momentum m is 0.9, the weight attenuation coefficient is 0.0005, the initial learning rate alpha is 0.1, and the learning rate attenuation coefficient is 0.01. The fitting was suppressed using the separation training method. Training CNN by repeatedly and randomly extracting 30 pictures in the training set, updating network parameters, and performing cross validation on the validation set after 10 iterations. When the cross-validation error reaches a minimum, training is stopped.

S4: inputting the human head window to be detected secondarily into the cluster analyzer, so that the detection results of all frames are mutually corrected, the error is reduced, and a high-precision human head detection result is generated;

s41: selecting an initial clustering center: inputting the secondary human head window to be detected into a clustering analyzer, selecting p human head windows in the frame of image with the most human head windows from the continuous N frames of images as initial p human head windows, and taking the centers of the p human head windows as initial clustering centers, wherein the clustering number is p;

s42: k-means iteration: obtaining p new clustering centers and the number n of the human heads contained in each class through iterative operation of a K-means algorithm ₁ ,n ₂ ,…,n _p ；

S43: judging whether a human head is contained: for the ith initial head window, if clustering, obtaining n _i Personal head window, if n _i >N/2, the human head window is considered to hold the human head; if n is _i <N/2, the human head window is considered to contain no human head;

s44: and repeating the process until the analysis of the whole video is completed.

The parameters of the cluster analyzer are set as follows: the frame rate of the video is 15fps, the interval time of the video samples is 5s, the interval frame number of the video samples is 75, the frame number N of single clustering input is 3, the sliding step of the clustering is 1, and the sample size is 32 multiplied by 32 pixels.

A timing diagram for cluster analysis using N consecutive images is shown in fig. 6. And setting the interval time between two adjacent frames in the original video as T, and outputting the human head detection result once by each video segment with the time length of M multiplied by N multiplied by T. After passing through the cluster analyzer, the time interval of the video segments corresponding to the two adjacent output results of the cluster analyzer is M multiplied by P multiplied by T. A video with the total duration L can output (L-MNT)/(MPT) +1 detection results in total, and each result comprises the position and the number of the human heads.

S5: and (4) extracting a new N frame from the P frame after the original N frame of the sampled video sequence, returning to the step (S2), and completing the analysis of the new N frame sampled image sequence until the human head detection result of all the video data is obtained.

And selecting 80-hour video to test the precision of the human head detection and counting software. The number of heads and the corresponding time were recorded every 15 seconds by manually watching the monitoring video for 80 hours. Approximately 14000 records were generated as a test set. The experimental results of the cascade classifier are shown in table 2, and the first-stage pre-classifier can screen out the regions with possible human heads at a higher speed almost without omission, and can exclude the regions without human heads. The second-level main classifier has high precision ratio and high recall ratio. The third level of cluster analyzers reduces head detection errors by utilizing the inherent links between adjacent frames. The results of the performance comparison of the method of the present invention with other algorithms are shown in table 3. The method of the invention considers the relation between video image sequences, can achieve the effect of real-time detection while obtaining high-precision detection results, and has better comprehensive performance.

TABLE 2 Performance of cascaded classifiers

Table 3 performance comparison with other algorithms

Based on the same inventive concept, the application also provides a device corresponding to the method in the first embodiment, which is detailed in the second embodiment.

Example two

In the present embodiment, a person detection apparatus suitable for big data is provided, as shown in fig. 7, including:

and the correction module is used for inputting the secondary human head window to be detected into the clustering analyzer, and mutually correcting the detection results of all frames to reduce errors and generate high-precision human head detection results.

Wherein the preliminary detection module specifically performs the following process:

s21, reading continuous N frames of images from the sampled image sequence each time, performing graying and Gaussian blur preprocessing, respectively performing frame difference on two adjacent frames of images, selecting a threshold value by using a histogram method for thresholding, and performing AND operation on the two frame differences to obtain a running imageMoving pixel P _ij Wherein i and j respectively indicate that the motion pixel is in the ith row and jth column of the histogram;

s22, moving pixel P _ij Nearby 4-position pixel P _i-2,j 、P _i+2,j 、P _ij 、P _i,j-2 、P _i,j+2 All extend into motion pixels;

s23, moving the motion pixel to the vertex of the image cell closest to the motion pixel, and taking the vertex as the center and the rectangular area with the size of W multiplied by H as a motion window;

s24, calculating the HOG characteristic vector of the motion window: dividing the motion window into a plurality of non-overlapping image cells with the size of 4 multiplied by 4, averagely dividing the gradient direction range from 0 degree to 360 degrees into 9 sections, calculating the histogram of the gradient direction distribution of pixels in each cell in each section, and multiplying the projection of the gradient direction of each pixel in the histogram by the size of the gradient; forming a rectangular image block by every 4 cells, wherein each image block corresponds to 4 9-dimensional histograms; each local histogram normalizes the feature vector using the L2-norm. For a W multiplied by H rectangular area, sliding the position of the image block by taking an image cell as a step length until the whole rectangular area is traversed to obtain the feature vectors of (W/4-1) × (H/4-1) image blocks, and connecting the vectors to obtain a gradient direction histogram of the rectangular area as the feature description of the rectangular area;

s25, inputting the calculated HOG feature vector into a trained pre-classifier, wherein the pre-classifier is an HOG-SVM classifier, and screening out a window possibly containing a human head;

wherein the preliminary correction module specifically performs the following process:

s41, inputting the secondary human head window to be detected into a clustering analyzer, selecting p human head windows in the frame of image with the most human head windows from the continuous N frames of images as initial p human head windows, and taking the centers of the p human head windows as initial clustering centers, wherein the clustering number is p;

s42, carrying out iterative operation through a K-means algorithm to obtain p new clustering centers and each class contains peopleNumber of heads n ₁ ,n ₂ ,…,n _p ；

S43, for the ith initial human head window, if the clustering is carried out, n is contained _i Personal head window, if n _i >N/2, the human head window is considered to hold the human head; if n is _i <N/2, the human head window is considered to contain no human head;

and S44, repeating the processes from S41 to S44 until the analysis of the whole video is completed.

Since the apparatus described in the second embodiment of the present invention is an apparatus used for implementing the method of the first embodiment of the present invention, based on the method described in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and the deformation of the apparatus, and thus the details are not described herein. All the devices adopted in the method of the first embodiment of the present invention belong to the protection scope of the present invention.

Based on the same inventive concept, the application provides an electronic device embodiment corresponding to the first embodiment, which is detailed in the third embodiment.

EXAMPLE III

The embodiment provides an electronic device, as shown in fig. 8, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program, so as to implement any one of the embodiments.

Since the electronic device described in this embodiment is a device used for implementing the method in the first embodiment of the present application, based on the method described in the first embodiment of the present application, a specific implementation of the electronic device in this embodiment and various variations thereof can be understood by those skilled in the art, and therefore, how to implement the method in the first embodiment of the present application by the electronic device is not described in detail herein. The equipment used by those skilled in the art to implement the method in the embodiments of the present application is all within the protection scope of the present application.

Based on the same inventive concept, the application provides a storage medium corresponding to the fourth embodiment, which is described in detail in the fourth embodiment.

Example four

The present embodiment provides a computer-readable storage medium, as shown in fig. 9, on which a computer program is stored, and when the computer program is executed by a processor, any one of the first embodiment can be implemented.

The technical scheme provided in the embodiment of the application has at least the following technical effects or advantages: the method, the device, the equipment and the medium provided by the embodiment of the application realize the human head detection under various scenes with more people based on the multi-stage cascade human head classifier, and integrate the static characteristics, the dynamic characteristics and the interframe correction. The first-stage pre-classifier can screen out the regions where the human heads are likely to appear at a high speed almost without omission, and can exclude the regions without the human heads. The second-level main classifier has high precision ratio and high recall ratio. And the third-level clustering analyzer reduces human head detection errors by utilizing the internal relation between adjacent frames, and performs interframe correction. The three-level cascade has the advantages that the high-precision detection result is obtained, the real-time detection effect can be achieved, and the comprehensive performance is better.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While specific embodiments of the invention have been described, it will be understood by those skilled in the art that the specific embodiments described are illustrative only and are not limiting upon the scope of the invention, as equivalent modifications and variations as will be made by those skilled in the art in light of the spirit of the invention are intended to be included within the scope of the appended claims.

Claims

1. A personnel detection method suitable for big data is characterized in that: the method comprises the following steps:

s2: taking out continuous N frames from the sampling video sequence every time, and inputting the continuous N frames into a pre-classifier to obtain a preliminary human head window to be detected;

s4: inputting the secondary human head window to be detected into a cluster analyzer, and mutually correcting the detection results of all frames to obtain a high-precision human head detection result;

the step S2 is specifically:

s21, reading continuous N frames of images from the sampled image sequence each time, performing graying and Gaussian blur preprocessing, respectively performing frame difference on two adjacent frames of images, selecting a threshold value by using a histogram method for thresholding, and performing AND operation on the two frame differences to obtain a motion pixel P _ij Wherein i and j respectively indicate that the motion pixel is in the ith row and the jth column of the histogram;

s24, calculating the HOG characteristic vector of the motion window: dividing the motion window into a plurality of non-overlapping image cells with the size of 4 multiplied by 4, averagely dividing the gradient direction range from 0 degree to 360 degrees into 9 sections, calculating the histogram of the gradient direction distribution of pixels in each cell in each section, and multiplying the projection of the gradient direction of each pixel in the histogram by the size of the gradient; forming a rectangular image block by every 4 cells, wherein each image block corresponds to 4 9-dimensional histograms; normalizing the feature vector by adopting an L2-norm for each local histogram, sliding the position of the image block by taking an image cell as a step length for a W multiplied by H rectangular area until the whole rectangular area is traversed to obtain the feature vectors of (W/4-1) × (H/4-1) image blocks, and connecting the vectors to obtain a gradient direction histogram of the rectangular area as the feature description of the rectangular area;

the main classifier is realized by a convolutional neural network, and the convolutional neural network comprises 2 convolution-activation-pooling layers, 1 rasterization layer, 1 full connection layer and 1 regression layer; the step S3 is specifically:

the network input is a gray image with the size of 32 multiplied by 32, and the gray image is firstly processed by a process of convolution C1-activation A1-pooling P1, then processed by a process of convolution C2-activation A2-pooling P2, and finally processed by a process of rasterization F-full connection FC-regression R, wherein:

convolutional layer C1 has 6 convolutional kernels each of 5 × 5 in size;

the activation function of the activation layer A1 is Softplus;

the convolution kernel size of the convolution C2 is 7 multiplied by 7, and each neuron of the convolution C2 is connected with 5 multiplied by 5 neighborhoods in 3-6 feature maps of the P1;

the activation function of the activation layer A2 is a Softplus function;

the pooling core size of the pooling layer P2 is 2 × 2;

the rasterization layer F sequentially takes out and arranges all elements formed after the pooling P2 according to the sequence to obtain a 256-dimensional vector which passes through a full connection layer FC containing 2 neurons;

the regression layer R performs logistic regression through a Softmax function to obtain the probability that the classification result falls into each class;

the input sample is a gray scale image of 32 multiplied by 32 when the main classifier is trained, the convolutional neural network is trained by using a random gradient descent method, overfitting is restrained by using a set-out method, and errors are calculated by using a cross entropy loss function.

2. The people detection method suitable for big data according to claim 1, characterized in that: the HOG-SVM classifier training parameters are set as follows: the sample size is 32 multiplied by 32 pixels, the Gamma correction parameter is 0.5, the binarization threshold value in the three-frame difference method is 128, the multi-scale search window proportion parameter is 1.1, the penalty coefficient of the generalized classification surface is 1, the bias term is 4.91, and a linear kernel function and a C-SVC model are adopted.

3. The people detection method suitable for big data according to claim 1, characterized in that: the specific process of the step S4 is as follows:

s42, carrying out iterative operation through a K-means algorithm to obtain p new clustering centers and the number n of people in each class ₁ ,n ₂ ,…,n _p ；

and S44, repeating the process until the analysis of the whole video is completed.

4. A people detection method suitable for big data according to claim 1 or 3, characterized in that: the parameters of the cluster analyzer are set as follows: the frame rate of the video is 15fps, the interval time of the video samples is 5s, the interval frame number of the video samples is 75, the frame number of single clustering input is 3, the sliding step of the clustering is 1, and the sample size is 32 multiplied by 32 pixels.

5. A personnel detection device that is suitable for big data which characterized in that: the method comprises the following steps:

the correction module is used for inputting the secondary human head window to be detected into the clustering analyzer, and reducing errors and generating a high-precision human head detection result by mutually correcting the detection results of each frame;

the preliminary detection module specifically executes the following process:

s24, calculating the HOG characteristic vector of the motion window: dividing the motion window into a plurality of non-overlapping image cells with the size of 4 multiplied by 4, averagely dividing the gradient direction range from 0 degree to 360 degrees into 9 sections, calculating the histogram of the gradient direction distribution of pixels in each cell in each section, and multiplying the projection of the gradient direction of each pixel in the histogram by the size of the gradient; forming a rectangular image block by every 4 cells, wherein each image block corresponds to 4 9-dimensional histograms; normalizing the feature vector by each local histogram by adopting an L2-norm; for a W multiplied by H rectangular area, sliding the position of the image block by taking an image cell as a step length until the whole rectangular area is traversed to obtain the feature vectors of (W/4-1) × (H/4-1) image blocks, and connecting the vectors to obtain a gradient direction histogram of the rectangular area as the feature description of the rectangular area;

the correction module specifically executes the following process:

s41, inputting the secondary human head window to be detected into a clustering analyzer, selecting p human head windows in the frame of image with the most human head windows from the continuous N frames of images as initial p human head windows, taking the centers of the p human head windows as initial clustering centers, wherein the clustering number is p;

S43, for the ith initial human head window, if the clustering is carried out, n is contained _i Personal head window, if n _i >N/2, the person is considered to have a head in the window; if n is _i <N/2, the human head window is considered to contain no human head;

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 4 when executing the program.

7. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 4.