US20190206085A1 - Human head detection method, eletronic device and storage medium - Google Patents

Human head detection method, eletronic device and storage medium Download PDF

Info

Publication number
US20190206085A1
US20190206085A1 US16/351,093 US201916351093A US2019206085A1 US 20190206085 A1 US20190206085 A1 US 20190206085A1 US 201916351093 A US201916351093 A US 201916351093A US 2019206085 A1 US2019206085 A1 US 2019206085A1
Authority
US
United States
Prior art keywords
human head
image
sub
electronic device
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US16/351,093
Inventor
Deqiang JIANG
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Assigned to TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED reassignment TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: JIANG, Deqiang
Publication of US20190206085A1 publication Critical patent/US20190206085A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/74Determining position or orientation of objects or cameras using feature-based methods involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/751Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G07CHECKING-DEVICES
    • G07CTIME OR ATTENDANCE REGISTERS; REGISTERING OR INDICATING THE WORKING OF MACHINES; GENERATING RANDOM NUMBERS; VOTING OR LOTTERY APPARATUS; ARRANGEMENTS, SYSTEMS OR APPARATUS FOR CHECKING NOT PROVIDED FOR ELSEWHERE
    • G07C9/00Individual registration on entry or exit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20021Dividing image into blocks, subimages or windows
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30242Counting objects in image

Definitions

  • This application relates to the technical field of image processing, and in particular, to a method, an electronic device and a storage medium for human head detection.
  • Human head detection refers to the detection of the head of a human body in an image, and a result of the human head detection has various applications, such as applications in the field of security.
  • the human head detection is implemented mainly based on a shape and color of a human head.
  • a specific process of the human head detection includes: first, binarizing image pixels, and then performing edge detection to acquire a substantially circular edge; and then using circle detection to acquire a position and size of the circular edge, and then performing gray scale and size determination on a corresponding circular area in the original image to obtain human head detection.
  • the human head detection relies on an assumption that the shape of the human head is circular.
  • the shape of the human head is not strictly circular, and the shapes of the human heads of different person are also different.
  • some human heads miss the detection and accuracy of the result of the human head detection is relatively low.
  • a human head detection method includes:
  • segmenting by an electronic device, an image to be detected into one or more sub-images
  • each sub-image inputting, by the electronic device, each sub-image to a convolutional neural network trained according to a training image having a marked human head position respectively, and outputting, by a preprocessing layer including a convolutional layer and a pooling layer in the convolutional neural network, a first feature corresponding to each sub-image;
  • mapping by the electronic device through a convolutional layer after the preprocessing layer in the convolutional neural network, the first feature corresponding to each sub-image to a second feature corresponding to each sub-image;
  • mapping by the electronic device through a regression layer in the convolutional neural network, the second feature corresponding to each sub-image to a human head position corresponding to each sub-image and a corresponding confidence level of the human head position;
  • An electronic device includes a memory and a processor, the memory storing a computer readable instruction, the computer readable instruction, when executed by the processor, causing the processor to perform the following steps:
  • each sub-image inputting each sub-image to a convolutional neural network trained according to a training image having a marked human head position respectively, and outputting, by a preprocessing layer including a convolutional layer and a pooling layer in the convolutional neural network, a first feature corresponding to each sub-image;
  • mapping through a convolutional layer after the preprocessing layer in the convolutional neural network, the first feature corresponding to each sub-image to a second feature corresponding to each sub-image;
  • mapping through a regression layer of the convolutional neural network, the second feature corresponding to each sub-image to a human head position corresponding to each sub-image and a corresponding confidence level of the human head position;
  • One or more non-volatile storage media storing a computer readable instruction is provided, the computer readable instruction, when executed by one or more processors, causing the processor to perform the following steps:
  • each sub-image inputting each sub-image to a convolutional neural network trained according to a training image having a marked human head position respectively, and outputting, by a preprocessing layer including a convolutional layer and a pooling layer in the convolutional neural network, a first feature corresponding to each sub-image;
  • mapping through a convolutional layer after the preprocessing layer in the convolutional neural network, the first feature corresponding to each sub-image to a second feature corresponding to each sub-image;
  • mapping through a regression layer in the convolutional neural network, the second feature corresponding to each sub-image to a human head position corresponding to each sub-image and a corresponding confidence level of the human head position;
  • FIG. 1 shows an application environment diagram of a human head detection method according to an embodiment.
  • FIG. 2 shows a schematic diagram of an internal structure of an electronic device according to an embodiment.
  • FIG. 3 shows a schematic flowchart of a human head detection method according to an embodiment.
  • FIG. 4 shows a schematic structural diagram of a convolutional neural network according to an embodiment.
  • FIG. 5 shows a schematic flowchart for converting a convolutional neural network for image classification to a convolutional neural network for human head detection.
  • FIG. 6 is a schematic flowchart for filtering human head positions according to confidence levels.
  • FIG. 7 is a schematic flowchart for implementing step 606 of FIG. 6 .
  • FIG. 8 is a schematic flowchart of performing human head tracking and performing people counting according in a video frame by frame.
  • FIG. 9 is a schematic flowchart for detecting a human head position in a current video image frame near a human head position tracked in a previous video frame and continuing to track the human head when the tracking of a human head position is interrupted at the previous video frame.
  • FIG. 10 illustrates an application scenario for human head detection and tracking.
  • FIG. 11 is a schematic diagram of performing people counting by using two parallel lines according to an embodiment.
  • FIG. 12 is a structural block diagram of a human head detection apparatus according to an embodiment.
  • FIG. 13 is a structural block diagram of a human head detection apparatus according to another embodiment.
  • FIG. 14 is a structural block diagram of a human head detection result determining module according to an embodiment.
  • FIG. 15 is a structural block diagram of a human head detection apparatus according to still another embodiment.
  • FIG. 16 is a structural block diagram of a human head detection apparatus according to yet another embodiment.
  • FIG. 1 is an application environment diagram of a human head detection method according to an embodiment.
  • the human head detection method is applied to a human head detection system, which includes an electronic device 110 and a top view camera 120 connected to the electronic device 110 .
  • the top view camera 120 is configured to capture an image to be detected and send the image to be detected to the electronic device 110 .
  • the top view camera may be mounted on the top of a building or at a wall above the height of a person (or a predetermined height) or at a corner of the top of the building, so that the top view camera can capture images of a top view angle.
  • the top view may be orthographic top view or top view of an oblique angle (alternatively referred to as perspective top view).
  • the electronic device 110 may be configured to segment an image to be detected into one or more sub-images; input each sub-image to a convolutional neural network trained according to training images having marked human head positions (or labeled with human heads)y, and output, by a preprocessing layer including at least one convolutional layer and at least one pooling layer in the convolutional neural network, a first feature corresponding to each sub-image; map, through at least one another convolutional layer after the preprocessing layer in the convolutional neural network, the first feature corresponding to each sub-image to a second feature corresponding to each sub-image; map, through at least one regression layer of the convolutional neural network, the second feature corresponding to each sub-image to a human head position corresponding to each sub-image and a corresponding confidence level of the human head position; and filter, according to the corresponding confidence level, the human head position corresponding to each sub-image, to acquire a human head positions detected in the image to be detected.
  • FIG. 2 is a schematic diagram of an internal structure of an electronic device according to an embodiment.
  • the electronic device includes a processor, a memory and a network interface which are connected by a system bus.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device may store an operating system and computer readable instructions. When being executed, the computer readable instruction may cause the processor to perform a human head detection method.
  • the processor of the electronic device may include a central processing unit and a graphics processing unit. The processor is configured to provide computing and control capabilities to support operation of the electronic device.
  • the internal memory may store the computer readable instruction. When being executed by the processor, the computer readable instruction may cause the processor to perform a human head detection method.
  • the network interface of the electronic device is configured to be connected to the top view camera.
  • the electronic device may be implemented by a integrated electronic device or a cluster including multiple electronic devices.
  • the electronic device may be a personal computer, a server or a dedicated human head detection device.
  • FIG. 2 is only a block diagram of a part of the structure related to the solution of this application, and does not constitute limitation on the electronic device to which the solution of this application is applied.
  • the specific electronic device may include more or fewer components than those shown in the figure, or may combine some components, or have different component arrangement.
  • FIG. 3 is a schematic flowchart of a human head detection method according to an embodiment. This embodiment is mainly illustrated by applying the method to the electronic device 110 in above FIG. 1 and FIG. 2 .
  • the human head detection method specifically includes the following steps:
  • S 302 Segment an image to be detected into one or more sub-images.
  • the image to be detected is an image on which human head detection needs to be performed.
  • the image to be detected may be a picture or a video frame in a video.
  • the sub-images are images which are segmented from the image to be detected and have a size smaller than the image to be detected. All segmented sub-images may have the same size or different sizes.
  • the electronic device may traverse a window of a fixed size in the image to be detected according to a transverse step size and a longitudinal step size, thereby segmenting the sub-images having the same size as the window size from the image to be detected during the traversal process.
  • the segmented sub-images may be combined to the image to be detected.
  • step S 302 includes: segmenting the image to be detected to one or more sub-images of a fixed size, adjacent sub-images in the segmented sub-images having an overlapping part.
  • the adjacent sub-images refer to that positions of the sub-images in the image to be detected are adjacent, and the adjacent sub-images may partially overlap.
  • the electronic device may traverse the window of a fixed size in the image to be detected according to the transverse step size smaller than a window width and the longitudinal step size smaller than a window height, to acquire one or more sub-images of the same size, and adjacent sub-images have an overlapping part.
  • S 304 Input each sub-image to a convolutional neural network trained according to a set of training images having marked human head positions, and output, by a preprocessing layer including at least one convolutional layer and at least one pooling layer in the convolutional neural network, a first feature corresponding to each sub-image.
  • the Convolutional Neural Network is an artificial intelligence neural network.
  • the convolutional neural network includes a preprocessing layer having at least one convolutional layer and at least one pooling layer.
  • the convolutional neural network used in this embodiment may be directly constructed, and may alternatively be acquired by reconstructing an existing convolutional neural network.
  • a computational task in the convolutional neural network may be implemented by a central processing unit or a graphics processing unit. Time consumed by the central processing unit for human head detection is proximately a level of seconds, and time consumed by the graphics processing unit for human head detection may be reduced to a level of hundred milliseconds, thereby realizing real-time human head detection.
  • each feature map includes a plurality of neurons, and all neurons of the same feature map share one convolution kernel.
  • the convolution kernel provides a weight of the corresponding neuron, and the convolution kernel represents a feature.
  • the convolution kernel is generally initialized in a form of a random decimal matrix, and a proper convolution kernel will be learned during training of the network to represent a feature.
  • the convolutional layer can reduce connection between various layers in the neural network, and in addition, a risk of overfitting is reduced.
  • Pooling may take two exemplary forms of implementation: mean pooling and max pooling. Pooling may be considered as a special convolutional process. Convolution and pooling greatly simplify complexity of the neural network and reduce parameters of the neural network.
  • the training images having human heads therein may be pre-marked (or labeled) with human head positions
  • human head positions in the training images may be manually marked or labeled, or may be marked or labeled using other automatic means.
  • the training images having the marked human head positions and the image to be detected may be images captured in the similar scene, setting or background, thereby further improving the accuracy of human head detection.
  • the training image having marked human head positions can be of the same size or different sizes as the image to be detected.
  • a confidence level may be assigned to the human head position marked in the training image.
  • the training image is segmented into one or more sub-images according to the same segmentation manner as that of the image to be detected.
  • the segmented sub-images are separately input to the convolutional neural network, and the convolutional neural network outputs human head positions and confidence levels.
  • a difference between the output head positions and the marked head position is calculated, and a difference between the corresponding confidence levels is calculated.
  • parameters of the convolutional neural network are adjusted.
  • the training is continued until a termination condition is reached.
  • the termination condition may be that each difference is less than a preset difference threshold, or the number of iterations reaches a preset number of times.
  • the preprocessing layer is used above as a general term of other layers in the convolutional neural network except for the regression layer and a convolutional layer before the regression layer.
  • the preprocessing layer may include at least one convolutional layer and at least one pooling layer.
  • the preprocessing layer may include parallel convolutional layers, and data output by the parallel convolutional layers may be spliced and input to a next layer.
  • the last layer in the preprocessing layer may be a convolutional layer or a pooling layer.
  • the preprocessing layer may include multiple pairs of convolutional layer and pooling layer connected in tandem.
  • the preprocessing layer may include additional rectifying layers.
  • a conventional convolutional neural network is generally used for classification, and the preprocessing layer in the convolutional neural network for classification is followed by a fully connected layer.
  • the fully connected layer may map the first feature output by the preprocessing layer to probability data corresponding to each preset type (or class). Therefore, a type to which an input image belongs may be determined by the regression layer.
  • the convolutional neural network is used for human head detection rather than classification.
  • the convolutional layer is configured to replace the fully connected layer, and to output the second feature for describing the sub-image features.
  • the number of the second features corresponding to each sub-image may be plural.
  • the human head position may be represented by a position of a rectangular box bounding a human head in the image.
  • the position of the rectangular box may be represented by a quadruple.
  • the quadruple may include a horizontal coordinate and a longitudinal coordinate of one vertex of the rectangular box and a width and a height of the rectangular box.
  • the quadruple may include a horizontal coordinate and a longitudinal coordinate of each of two diagonal vertexes of the rectangular box.
  • the confidence level output by the regression layer are in a one-to-one correspondence with the human head position output by the regression layer, thereby indicating a probability that the corresponding rectangular box does correspond to a human head at the corresponding position in the image.
  • the regression layer may use a support vector machine (SVM).
  • SVM support vector machine
  • step S 308 includes: mapping, through the convolutional layer in the regression layer in the convolutional neural network, the second feature corresponding to each sub-image to the human head position corresponding to each sub-image and the confidence level corresponding to the human head position.
  • the electronic device may directly map the second feature corresponding to each sub-image to the human head position corresponding to each sub-image and the confidence level corresponding to the human head position through the same convolutional layer in the regression layer in the convolutional neural network.
  • step S 308 includes: mapping, through a first convolutional layer in the regression layer in the convolutional neural network, the second feature corresponding to each sub-image to the human head position corresponding to each sub-image; and mapping, through a second convolutional layer in the regression layer in the convolutional neural network, the second feature corresponding to each sub-image to the confidence level corresponding to the output human head position.
  • the sub-image outputs 128 feature matrices (feature maps) each with a size M*N through the preprocessing layer in the convolutional neural network.
  • 128 is a preset value for number of features, and can be set as needed.
  • M and N are determined by parameters of the preprocessing layer.
  • the 128 feature matrices with the size M*N are input to the convolutional layer after the preprocessing layer.
  • M*N feature vectors with a length 1024 are output.
  • the M*N feature vectors with the length 1024 are input to the first convolutional layer in the regression layer, and are convoluted by a parameter matrix with a size 1024*4 in the first convolutional layer, and M*N quadruples representing the human head position are output.
  • the M*N feature vectors with the length 1024 are input to the second convolutional layer in the regression layer, and are convoluted by a parameter vector with a size 1024*1 in the second convolutional layer, and M*N tuples indicating the confidence level the human head position are output.
  • a correspondence relationship between the human head position and the confidence level is embodied in an order of the output M*N quadruples and the tuples.
  • the electronic device may compare the confidence level of each human head position output by the convolutional neural network with a confidence level threshold, and filter out human head positions of which confidence levels are less than the confidence level threshold.
  • the electronic device may further filter the human head positions, of which areas are smaller than a preset area, in the human head positions filtered by using the confidence level threshold.
  • the electronic device may cluster the filtered human head positions to combine the plurality of human head positions of the same type to acquire one combined human head position in the image to be detected, or select one of the plurality of human head positions clustered to the same type as the human head position in the image to be detected.
  • the convolutional neural network is trained in advance based on the training images having the marked human head position, and the convolutional neural network can automatically learn human head features.
  • the trained convolutional neural network can automatically extract appropriate features from the sub-images to output candidate human head positions and corresponding confidence levels, and then filter, according to the confidence levels, to acquire the human head position in the image to be detected.
  • the human head shape is learned rather than pre-assumed. As such a missed detection caused by presuming the shape of the human head can be avoided, and accuracy of the human head detection is improved.
  • the first features of the sub-images are output by the preprocessing layer including the convolutional layer and the pooling layer, and the second features are outputted by the convolutional layer after the preprocessing layer and before the regression layer to accurately describe human head features in the sub-images. Therefore, the second features are directly mapped to the human head positions and confidence levels by the regression layer, which is new application of the convolutional neural network of the new structure. Compared with the traditional circle detection, the accuracy of the human head detection is greatly improved.
  • the human head detection method further includes a step of converting and training the convolutional neural network for classification to a convolutional neural network for human head detection.
  • the step of converting and training the convolutional neural network for classification to a convolutional neural network for human head detection includes the following steps:
  • a conventional convolutional neural network for classification is a trained convolutional neural network which can classify images input to the convolutional neural network, such as GoogleNet, VGGNET, or AlexNet.
  • the convolutional neural network for classification includes the preprocessing layer, the fully connected layer, and the regression layer.
  • the fully connected layer is configured to output second features corresponding to each preset type (or class) of the conventional classification application.
  • the sparse connection and weight sharing of the fully connected layer and the convolutional layer are different.
  • Each neuron of the fully connected layer is connected to all neurons of a preceding layer.
  • Both the convolutional layer and the fully connected layer acquire input of a next layer by multiplying output of the preceding layer by a parameter matrix.
  • the conventional fully connected layer can be converted to the convolutional layer by changing an arrangement manner of parameters of the fully connected layer.
  • the regression layer is configured to map the second features of each preset type output by the fully connected layer to a probability corresponding to each preset type, and determine, according to the mapped probability, a preset type to which the image belongs. For example, a preset type corresponding to a maximum probability is selected as the preset type to which the input image belongs.
  • the regression layer is configured to map a preset number of second features output by the converted convolutional layer to the human head positions and the confidence levels corresponding to the human head positions.
  • the regression layer may use a convolutional layer.
  • the convolutional layer directly maps the second features to the human head positions and the confidence levels corresponding to the human head positions.
  • the regression layer may also use two convolutional layers in parallel. One convolutional layer is configured to map the second features to the human head positions, and the other convolutional layer is configured to map the second features to the confidence levels corresponding to the mapped human head positions.
  • the convolutional neural network including the preprocessing layer, the converted convolutional layer and the replaced regression layer is reconstructed and modified from the conventional convolutional neural network for classification applications.
  • parameters of the preprocessing layer may be pre-trained.
  • the training processing may be joint process. For example, the entire network may be trained.
  • the preprocessing layer training parameters may be initialized to its pre-trained parameters and retrained together with the rest of network.
  • the confidence level may be pre-assigned to the marked human head positions of the training image.
  • the training image is segmented into one or more sub-images according to the same segmenting manner as that of the image to be detected.
  • the segmented sub-images are respectively input to the convolutional neural network, and the human head positions and the confidence levels are output by the preprocessing layer, the convolutional layer after the preprocessing layer, and the regression layer of the convolutional neural network.
  • the difference between the output human head positions and the marked human head position is calculated, and the difference between the corresponding confidence levels is calculated, and the parameters in the preprocessing layer, the convolutional layer after the preprocessing layer, and the regression layer in the convolutional neural network are adjusted according to the two differences.
  • the training is continued until a termination condition is reached.
  • the termination condition may be that the difference is less than a preset difference, or the number of iterations reaches a preset number of times.
  • the training is performed after reconstruction of conventional neural network for classification into a the convolutional neural network for human head detection.
  • the reconstruction does not require complete redesign of the neural network, training duration can be reduced and efficiency of human head detection is improved.
  • step S 310 specifically includes the following steps:
  • S 602 Screen, from the human head positions corresponding to the sub-images, to acquire a human head position corresponding to a confidence level greater than or equal to a confidence level threshold.
  • the electronic device may form the human head positions respectively corresponding to the sub-images segmented from the image to be detected into a human head position set, traverse the human head position set, and compare confidence levels the traversed human head positions with the confidence level threshold.
  • the human head positions having confidence levels lower than the confidence level threshold may be removed from the human head position set.
  • the remaining human head positions in the human head position set after the traversing are the acquired human head positions of which the corresponding confidence levels are greater than or equal to the confidence level threshold.
  • the confidence level threshold may be set as needed, for example, may be valued from 0.5 to 0.99.
  • S 604 Selecting or identifying human head positions among the screened human head positions in S 602 that intersect in the image to be detected.
  • the intersection of the human head positions means that enclosed areas indicated by respective human head positions have an intersection in the image to be detected.
  • the human head position is represented by a position of a rectangular box including the human head image
  • the intersection of the human head positions is the intersection of the corresponding rectangular boxes.
  • the electronic device may select a human head position intersecting with the acquired human head position in the image to be detected from the human head position set formed by the human head positions respectively corresponding to all the sub-images segmented from the image to be detected.
  • the electronic device may also seek for the intersecting human head positions from only the acquired human head positions.
  • S 606 Determine, according to the acquired human head position and the identified human head position, the human head position detected in the image to be detected.
  • the electronic device may classify the acquired human head positions and the selected human head positions.
  • Each type includes at least one of the acquired human head positions, and also includes human head positions intersecting with the at least one human head position.
  • the electronic device may combine the human head positions of each type to one human head position as a detected head position, or select one human head position from the human head positions of each type as the detected human head position.
  • the accuracy of human head detection can be further improved by using the confidence levels and the position intersection the basis for determining the human head position in the image to be detected.
  • step S 606 specifically includes the following steps:
  • S 702 Use the acquired human head position (from step 602 ) and the selected human head position (from step 604 ) as nodes in a bipartite graph, as a first group and second group, respectively.
  • the bipartite graph is a graph in the graph theory, the nodes in the bipartite graph may be segmented into two groups, and all edges connected to the nodes are caused to span boundaries of the groups.
  • the default and positive weight is a positive value, such as 1000.
  • An edge combination in the bipartite graph is a set of edges have no common nodes. If a particular weight sum of edges of one of all the edge combinations of one bipartite graph is largest, this particular edge combination is referred to as the maximum weight edge combination.
  • the electronic device may traverse all edge combinations in the bipartite graph to find the maximum weight edge combination.
  • the electronic device may also use a Kuhn-Munkres algorithm to solve the maximum weight edge combination of the bipartite graph. After the maximum weight edge combination is solved, the human head positions associated with the edges in the maximum weight edge combination can be used as the human head position detected in the image to be detected.
  • the intersecting human head positions may correspond to the same human head
  • the human head positions output by the convolutional neural network are mostly gathered near the actual human head position in the image to be detected. Therefore, the acquired human head positions (step 602 , for example) and the selected human head positions (step 605 , for example) are used as the nodes in the bipartite graph to construct the bipartite graph, and weights of the corresponding edges of the intersecting human head positions are reduced.
  • the detected human head position in the image to be detected are acquired, and the human head detection can be performed more accurately.
  • the image to be detected may be a video frame in a video
  • the human head detection method further includes a step of performing human head tracking and performing people counting frame by frame.
  • the step of performing human head tracking and performing people counting frame by frame specifically includes the following steps:
  • S 802 Perform human head tracking video according to the human head position detected in the image frame to be detected frame by video frame.
  • the electronic device after detecting the human head position in one video frame, performs the human head tracking video frame by video frame by using the detected human head position as a starting point.
  • the electronic device may specifically use a mean shift (average drift) tracking algorithm, an optical flow tracking algorithm, or a tracking-learning-detection (TLD) algorithm.
  • S 804 Determine a moving direction and a positional relationship of the tracked human head position relative to a designated area.
  • the designated area refers to the area designated in the video frame.
  • the moving direction of the tracked human head position relative to the designated area refers to that the human head position is, for example, moving toward or away from the designated area.
  • the positional relationship of the tracked human head position relative to the designated area refers to that the human head position is inside or outside the designated area.
  • the tracked human head position crosses a line representing a boundary of the designated area in a direction toward the designated area, it is determined that the tracked human head position enters the designated area.
  • the tracked human head position crosses the line representing the boundary of the designated area in a direction away from the designated area, it is determined that the tracked human head position leaves the designated area.
  • the tracked human head position when the tracked human head position sequentially crosses a first line and a second line parallel with the first line, it is determined that the tracked human head position enters the designated area. When the tracked human head position sequentially crosses the second line and the first line, it is determined that the tracked human head position leaves the designated area.
  • the parallel first line and second line may be straight lines or curved lines.
  • the designated area may be one of two areas formed by segmenting the image to be detected by the second line, without including the first line.
  • the moving direction and the positional relationship of the tracked human head position relative to the designated area are determined by the two lines, thereby preventing a judgment error caused by movement of the human head position in the vicinity of the boundary of the designated area, thereby ensuring the correctness of people counting.
  • the people counting may be specifically counting a combination of one or more of the number of accumulated people entering the designated area, the number of accumulated people leaving the designated area, and the dynamic number of people entering the designated area. Specifically, the electronic device may add 1 to the number of statistically accumulated people entering the designated area, and/or add 1 to the number of dynamic people entering the designated area when one tracked human head position enters the designated area. The electronic device may add 1 to the number of statistically accumulated people leaving the designated area, and/or subtract 1 from the number of dynamic people entering the designated area when one tracked human head position leaves the designated area
  • the human head detection may applied to security application.
  • the people counting is performed according to the moving direction and the positional relationship of the tracked human head position relative to the designated area. Based on accurate human head detection, the accuracy of people counting can be improved.
  • the human head detection method further includes a step of detecting the human head position and continuing tracking near the human head position tracked in a previous video frame when the tracking of the human head position is interrupted.
  • the step specifically includes the following steps:
  • the electronic device tracks the detected human head position with the detected human head position in the image to be detected as a starting point, and records the tracked human head position.
  • the tracking of the human head position may be interrupted, and in this case, the human head position tracked in the previous video frame and recorded during the tracking video frame by video frame is acquired.
  • S 906 Detect human head positions in a local area covering the acquired human head position (in step 904 ) in the current video frame.
  • the local area covering the acquired human head position is smaller than a size of one video frame, and larger than a size of the area occupied by the human head position tracked in the previous video frame.
  • a shape of the local area may be similar to a shape of the area occupied by the human head position tracked in the previous video frame.
  • a center of the local area may overlap with a center of the area occupied by the human head position tracked in the previous video frame.
  • the electronic device may detect the human head positions in the current video frame to find the human head positions belonging to the local area.
  • the electronic device may also detect the human head positions only in the local area.
  • the electronic device may specifically use the steps of steps S 302 to S 310 to detect the human head positions in the local area in the current video frame.
  • the detected human head positions may be partially or entirely located in the local area.
  • the electronic device may use the human head positions of which the centers are within the local area as the human head positions in the detected local area, and the human head positions of which the centers are outside the local area do not belong to the human head positions in the local area.
  • the human head position is represented by a position of a rectangular box including the human head image
  • a width of the rectangular box tracked in the previous video frame is W and a height is H
  • a and b are set to coefficients greater than 1
  • the local area may be the rectangular area having a width of a*W and a height of b*H and the same center as the rectangular box.
  • the center coordinates of the rectangular box tracked in the previous video frame are (X1, X2) and the center coordinates of another rectangular box indicating the human head position are (X2, Y2), then when
  • step S 908 Continue to perform step S 902 from the human head position detected in the local area.
  • the human head positions when the tracking of the human head positions is interrupted, the human head positions can be detected from the vicinity of the human head positions detected in the previous frame, and the interrupted human head tracking can be recovered from the interruption and continued.
  • the human head detection and the human head tracking are combined to ensure the continuity of the tracking. Further, the accuracy of people counting is ensured.
  • a large number of top view images at an elevator entrance scene are acquired in advance, and the human head positions in these top view images are marked or labeled.
  • a quadruple is used to indicate the position of the human head image in a rectangular box 1001 in FIG. 10 .
  • a convolutional neural network for classification is selected, the fully connected layer after the preprocessing layer and before the regression layer is converted to a convolutional layer, and the regression layer therein is replaced with the regression layer configured to map the second feature output by the converted convolutional layer to the human head position and the corresponding confidence level, thereby retraining the convolutional neural network by using the marked top view image.
  • a top view camera is disposed above a gate, and the videos are captured by the top view camera and transmitted to an electronic device connected to the top view camera.
  • the electronic device uses an image area sandwiched by a line 1101 and a line 1104 in one of the video frames as an image to be detected, and segments the image to be detected into one or more sub-images.
  • Each sub-image is input to a convolutional neural network trained by training images having a marked human head positions.
  • the convolutional neural network outputs the human head positions corresponding to each sub-image and the confidence level corresponding to the human head positions, thereby filtering, according to the corresponding confidence level, the human head positions corresponding to each sub-image, and acquiring the human head positions detected in the image to be detected.
  • the electronic device performs human head tracking video frame by video frame according to the human head position detected in the image to be detected, and it is determined that a tracked human head position 1105 enters a designated area when the tracked human head position 1105 sequentially crosses a first line 1102 and a second line 1103 parallel with the first line 1102 .
  • a tracked human head position 1106 sequentially crosses the second line 1103 and the first line 1102 , it is determined that the tracked human head position 1106 leaves the designated area.
  • the designated area in FIG. 11 may be specifically the area sandwiched by the second line 1103 and a line 1104 .
  • an electronic device is further provided, and an internal structure of the electronic device may be shown in FIG. 2 .
  • the electronic device includes a human head detection apparatus.
  • the human head detection apparatus includes various modules, and the modules may be all or partially implemented by software, hardware or a combination thereof.
  • FIG. 12 is a structural block diagram of a human head detection apparatus 1200 according to an embodiment.
  • the human head detection apparatus 1200 includes a segmenting module 1210 , a convolutional neural network module 1220 , and a human head detection result determining module 1230 .
  • the segmenting module 1210 is configured to segment an image to be detected into one or more sub-images.
  • the convolutional neural network module 1220 is configured to segment the image to be detected into one or more sub-images; input each sub-image to a convolutional neural network trained according to training images having marked human head positions, and output, by a preprocessing layer including at least one convolutional layer and at least one pooling layer in the convolutional neural network, a first feature corresponding to each sub-image; map, through the convolutional layer after the preprocessing layer in the convolutional neural network, the first feature corresponding to each sub-image to a second feature corresponding to each sub-image; and map, through a regression layer of the convolutional neural network, the second feature corresponding to each sub-image to a human head position corresponding to each sub-image and a corresponding confidence level of the human head position.
  • the human head detection result determining module 1230 is configured to filter, according to the corresponding confidence level, the human head position corresponding to each sub-image, to acquire a human head position detected in the image to be detected.
  • the convolutional neural network is trained in advance based on the training image having the marked human head position, and the convolutional neural network can automatically learn human head features.
  • the trained convolutional neural network can automatically extract appropriate features from the sub-images to output candidate human head positions and corresponding confidence levels, and then filter, according to the confidence levels, to acquire the human head position in the image to be detected.
  • the human head shape is not required to be assumed in advance, a missed detection caused by setting the human head shape can be avoided, and accuracy of the human head detection is improved.
  • the first features of the sub-images are output by the preprocessing layer including the convolutional layer and the pooling layer, and the second features are outputted by the convolutional layer after the preprocessing layer and before the regression layer to accurately describe human head features in the sub-images. Therefore, the second features are directly mapped to the human head positions and confidence levels by the regression layer, which is new application of the convolutional neural network of the new structure. Compared with the traditional circle detection, the accuracy of the human head detection is greatly improved.
  • the segmenting module 1210 is further configured to segment the image to be detected into one or more sub-images of a fixed size, and adjacent sub-images in the segmented sub-images have an overlapping part. In this embodiment, there is an overlapping part between the adjacent segmented sub-images, thereby ensuring that the adjacent sub-images have stronger correlation, and improving accuracy of detecting a human head position from the image to be detected.
  • the human head detection apparatus 1200 further includes a convolutional neural network adjusting module 1240 and a training module 1250 .
  • the convolutional neural network adjusting module 1240 is configured to convert a fully connected layer after the preprocessing layer and before the regression layer included in the convolutional neural network for classification to a convolutional layer; and replace a regression layer in the convolutional neural network for classification with a regression layer configured to map the second feature output by the converted convolutional layer to the human head position and the corresponding confidence level.
  • the training module 1250 is configured to train the convolutional neural network including the preprocessing layer, the converted convolutional layer and the replaced regression layer by using the training image having the marked human head position.
  • the training after reconstruction is performed based on the convolutional neural network for classification, to acquire the convolutional neural network for human head detection.
  • the reconstruction of the convolutional neural network is not required, the training duration can be reduced and the efficiency of human head detection is improved.
  • the convolutional neural network module 1220 is further configured to map, through a first convolutional layer in the regression layer in the convolutional neural network, the second feature corresponding to each sub-image to a human head position corresponding to each sub-image; and map, through a second convolutional layer in the regression layer in the convolutional neural network, the second feature corresponding to each sub-image to a confidence level corresponding to the output human head position.
  • the human head detection result determining module 1230 includes a filtering module 1231 and a head position determining module 1232 .
  • the filtering module 1231 is configured to screen, from the human head positions corresponding to the sub-images, to acquire a human head position corresponding to a confidence level greater than or equal to a confidence level threshold; and select a human head position intersecting with the acquired human head position in the image to be detected from the human head positions corresponding to the sub-images.
  • the human head position determining module 1232 is configured to determine, according to the acquired human head position and the selected human head position, the human head position detected in the image to be detected.
  • the accuracy of the human head detection can be further improved by using the confidence levels and the intersection or not as the basis for determining the human head position in the image to be detected.
  • the human head position determining module 1232 is further configured to use the acquired human head position and the selected human head position as nodes in a bipartite graph; assign default and positive weights to edges between the nodes in the bipartite graph; reduce the corresponding assigned weights when the human head positions indicated by the nodes associated with the edges intersect; and solve a maximum weight edge combination of the bipartite graph, to acquire the head position detected in the image to be detected.
  • the human head positions output by the convolutional neural network are mostly gathered near the actual human head position in the image to be detected. Therefore, the acquired human head positions and the selected human head positions are used as nodes in the bipartite graph to construct the bipartite graph, and weights of the corresponding edges of the intersecting human head positions are relatively small. By solving the maximum weight edge combination, the human head position detected in the image to be detected are acquired, and the human head detection can be performed more accurately.
  • the image to be detected is a video frame in a video.
  • the human head detection apparatus 1200 further includes:
  • a tracking module 1260 configured to perform head tracking video frame by video frame according to the human head position detected in the image to be detected;
  • a counting condition detecting module 1270 configured to determine a moving direction and a positional relationship of the tracked human head position relative to the designated area
  • a people counting module 1280 configured to perform people counting based on the determined moving direction and positional relationship.
  • the human head detection is applied to the field of security.
  • the people counting is performed according to the moving direction and the positional relationship of the tracked human head position relative to the designated area. Based on accurate human head detection, the accuracy of people counting can be ensured.
  • the counting condition detecting module 1270 is further configured to: determine that the tracked human head position enters the designated area when the tracked human head position sequentially spans a first line and a second line parallel with the first line; and determine that the tracked human head position leaves the designated area when the tracked human head position sequentially span the second line and the first line.
  • the moving direction and the positional relationship of the tracked human head position relative to the designated area are determined by two lines, thereby preventing a judgment error caused by the moving of the human head position near a boundary of the designated area, thereby ensuring the correctness of people counting.
  • the human head detecting module 1200 further includes a human head position acquiring module 1290 .
  • the tracking module 1260 is further configured to track and record the human head position video frame by video frame.
  • the human head position acquiring module 1290 is configured to acquire a human head position tracked in a previous recorded video frame if the tracking of the human head position in a current video frame is interrupted.
  • the convolutional neural network module 1220 is further configured to detect human head positions in a local area covering the acquired head position in the current video frame.
  • the tracking module 1260 is further configured to continue to perform the step of tracking and recording the human head position video frame by video frame from the human head positions detected in the local area.
  • the human head positions when the tracking of the human head positions is interrupted, the human head positions can be detected from the vicinity of the human head positions detected in the previous frame, and the interrupted human head tracking can be continued.
  • the human head detection and the human head tracking are combined to ensure the continuity of the tracking. Further, the accuracy of people counting is ensured.
  • steps in various embodiments of this application are not necessarily performed in an order indicated by the step numbers. Unless explicitly described in this specification, there is no strict sequence for execution of the steps.
  • at least some steps in the embodiments may include a plurality of substeps or a plurality of stages.
  • the substeps or the stages are not necessarily performed at a same moment, and instead may be performed at different moments.
  • a performing sequence of the substeps or the stages is not necessarily performing in sequence, and instead may be performed in turn or alternately with another step or at least some of substeps or stages of the another step.
  • the non-volatile memory may include: a read-only memory (ROM), a programmable ROM (PROM), an electrically programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), or a flash memory.
  • ROM read-only memory
  • PROM programmable ROM
  • EPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • the volatile memory may include a random access memory (RAM) or an external cache memory.
  • the RAM may be implemented in various forms, such as a static RAM (SRAM), a dynamic RAM (DRAM), a synchronous DRAM (SDRAM), a double data rate SDRAM (DDRSDRAM), an enhanced SDRAM (ESDRAM), a synchlink DRAM (SLDRAM), a rambus direct RAM (RDRAM), a direct rambus dynamic RAM (DRDRAM), and a rambus dynamic RAM (RDRAM).
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchlink DRAM
  • RDRAM rambus direct RAM
  • DRAM direct rambus dynamic RAM
  • RDRAM rambus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Human Computer Interaction (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)

Abstract

A method for detecting and tracking human head in an image by an electronic device is disclosed. The method may include segmenting the image into one or more sub-images; inputting each sub-image to a convolutional neural network trained according to training images having marked human head positions; outputting by a preprocessing layer of the convolutional neural network comprising a first convolutional layer and a pooling layer, a first feature corresponding to each sub-image; mapping through a second convolutional layer the first feature corresponding to each sub-image to a second feature corresponding to each sub-image; mapping through a regression layer the second feature corresponding to each sub-image to a human head position corresponding to each sub-image and a corresponding confidence level of the human head position; and filtering, according to the corresponding confidence level, human head positions corresponding to the one or more sub-images, to acquire detected human head positions in the image.

Description

    RELATED APPLICATION
  • This application is a continuation application of the International PCT Application No. PCT/CN2018/070008, filed with the Chinese Patent Office on Jan. 2, 2018 and claims priority to Chinese Patent Application No. 2017100292446, filed with the Chinese Patent Office on Jan. 16, 2017 and entitled “HUMAN HEAD DETECTION METHOD AND APPARATUS”, which is incorporated herein by reference in its entirety.
  • FIELD OF THE TECHNOLOGY
  • This application relates to the technical field of image processing, and in particular, to a method, an electronic device and a storage medium for human head detection.
  • BACKGROUND OF THE DISCLOSURE
  • Human head detection refers to the detection of the head of a human body in an image, and a result of the human head detection has various applications, such as applications in the field of security. At present, the human head detection is implemented mainly based on a shape and color of a human head. At present, a specific process of the human head detection includes: first, binarizing image pixels, and then performing edge detection to acquire a substantially circular edge; and then using circle detection to acquire a position and size of the circular edge, and then performing gray scale and size determination on a corresponding circular area in the original image to obtain human head detection.
  • However, currently, the human head detection relies on an assumption that the shape of the human head is circular. In fact, the shape of the human head is not strictly circular, and the shapes of the human heads of different person are also different. As a result, during the current human head detection, some human heads miss the detection and accuracy of the result of the human head detection is relatively low.
  • SUMMARY
  • According to various embodiments provided by this disclosure, methods, an electronic devices and a storage media are provided for implementing human head detection in images.
  • A human head detection method includes:
  • segmenting, by an electronic device, an image to be detected into one or more sub-images;
  • inputting, by the electronic device, each sub-image to a convolutional neural network trained according to a training image having a marked human head position respectively, and outputting, by a preprocessing layer including a convolutional layer and a pooling layer in the convolutional neural network, a first feature corresponding to each sub-image;
  • mapping, by the electronic device through a convolutional layer after the preprocessing layer in the convolutional neural network, the first feature corresponding to each sub-image to a second feature corresponding to each sub-image;
  • mapping, by the electronic device through a regression layer in the convolutional neural network, the second feature corresponding to each sub-image to a human head position corresponding to each sub-image and a corresponding confidence level of the human head position; and
  • filtering, by the electronic device according to the corresponding confidence level, the human head position corresponding to each sub-image, to acquire a human head position detected in the image to be detected.
  • An electronic device includes a memory and a processor, the memory storing a computer readable instruction, the computer readable instruction, when executed by the processor, causing the processor to perform the following steps:
  • segmenting an image to be detected into one or more sub-images;
  • inputting each sub-image to a convolutional neural network trained according to a training image having a marked human head position respectively, and outputting, by a preprocessing layer including a convolutional layer and a pooling layer in the convolutional neural network, a first feature corresponding to each sub-image;
  • mapping, through a convolutional layer after the preprocessing layer in the convolutional neural network, the first feature corresponding to each sub-image to a second feature corresponding to each sub-image;
  • mapping, through a regression layer of the convolutional neural network, the second feature corresponding to each sub-image to a human head position corresponding to each sub-image and a corresponding confidence level of the human head position; and
  • filtering, according to the corresponding confidence level, the human head position corresponding to each sub-image, to acquire the human head position detected in the image to be detected.
  • One or more non-volatile storage media storing a computer readable instruction is provided, the computer readable instruction, when executed by one or more processors, causing the processor to perform the following steps:
  • segmenting an image to be detected into one or more sub-images;
  • inputting each sub-image to a convolutional neural network trained according to a training image having a marked human head position respectively, and outputting, by a preprocessing layer including a convolutional layer and a pooling layer in the convolutional neural network, a first feature corresponding to each sub-image;
  • mapping, through a convolutional layer after the preprocessing layer in the convolutional neural network, the first feature corresponding to each sub-image to a second feature corresponding to each sub-image;
  • mapping, through a regression layer in the convolutional neural network, the second feature corresponding to each sub-image to a human head position corresponding to each sub-image and a corresponding confidence level of the human head position; and
  • filtering, according to the corresponding confidence level, the human head position corresponding to each sub-image, to acquire the human head position detected in the image to be detected.
  • Details of one or more embodiments of this application are put forward in the following accompanying drawings and descriptions. Other features, objectives, and advantages of this application become more obvious with reference to the specification, the accompanying drawings, and the claims.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • To describe the technical solutions of the embodiments of this application more clearly, the following briefly introduces the accompanying drawings required for describing the embodiments. The accompanying drawings described below are only some embodiments of this application, and a person of ordinary skill in the art can obtain other accompanying drawings according to these accompanying drawings without creative efforts.
  • FIG. 1 shows an application environment diagram of a human head detection method according to an embodiment.
  • FIG. 2 shows a schematic diagram of an internal structure of an electronic device according to an embodiment.
  • FIG. 3 shows a schematic flowchart of a human head detection method according to an embodiment.
  • FIG. 4 shows a schematic structural diagram of a convolutional neural network according to an embodiment.
  • FIG. 5 shows a schematic flowchart for converting a convolutional neural network for image classification to a convolutional neural network for human head detection.
  • FIG. 6 is a schematic flowchart for filtering human head positions according to confidence levels.
  • FIG. 7 is a schematic flowchart for implementing step 606 of FIG. 6.
  • FIG. 8 is a schematic flowchart of performing human head tracking and performing people counting according in a video frame by frame.
  • FIG. 9 is a schematic flowchart for detecting a human head position in a current video image frame near a human head position tracked in a previous video frame and continuing to track the human head when the tracking of a human head position is interrupted at the previous video frame.
  • FIG. 10 illustrates an application scenario for human head detection and tracking.
  • FIG. 11 is a schematic diagram of performing people counting by using two parallel lines according to an embodiment.
  • FIG. 12 is a structural block diagram of a human head detection apparatus according to an embodiment.
  • FIG. 13 is a structural block diagram of a human head detection apparatus according to another embodiment.
  • FIG. 14 is a structural block diagram of a human head detection result determining module according to an embodiment.
  • FIG. 15 is a structural block diagram of a human head detection apparatus according to still another embodiment.
  • FIG. 16 is a structural block diagram of a human head detection apparatus according to yet another embodiment.
  • DESCRIPTION OF EMBODIMENTS
  • To make the objectives, technical solutions, and advantages of this application clearer, the following disclosure further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that specific embodiments described herein are merely intended to explain this application and are not intended to limit this application.
  • While the disclosure herein specifically refer to human head detection in top view images, the underlying principle may be applied to detection of other objects in any type of images. For example, the systems and methods disclosed below may be applied to detection of motor vehicles in satellite images for monitoring traffic and the like.
  • FIG. 1 is an application environment diagram of a human head detection method according to an embodiment. Referring to FIG. 1, the human head detection method is applied to a human head detection system, which includes an electronic device 110 and a top view camera 120 connected to the electronic device 110. The top view camera 120 is configured to capture an image to be detected and send the image to be detected to the electronic device 110. The top view camera may be mounted on the top of a building or at a wall above the height of a person (or a predetermined height) or at a corner of the top of the building, so that the top view camera can capture images of a top view angle. The top view may be orthographic top view or top view of an oblique angle (alternatively referred to as perspective top view).
  • In an embodiment, the electronic device 110 may be configured to segment an image to be detected into one or more sub-images; input each sub-image to a convolutional neural network trained according to training images having marked human head positions (or labeled with human heads)y, and output, by a preprocessing layer including at least one convolutional layer and at least one pooling layer in the convolutional neural network, a first feature corresponding to each sub-image; map, through at least one another convolutional layer after the preprocessing layer in the convolutional neural network, the first feature corresponding to each sub-image to a second feature corresponding to each sub-image; map, through at least one regression layer of the convolutional neural network, the second feature corresponding to each sub-image to a human head position corresponding to each sub-image and a corresponding confidence level of the human head position; and filter, according to the corresponding confidence level, the human head position corresponding to each sub-image, to acquire a human head positions detected in the image to be detected.
  • FIG. 2 is a schematic diagram of an internal structure of an electronic device according to an embodiment. Referring to FIG. 2, the electronic device includes a processor, a memory and a network interface which are connected by a system bus. The memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device may store an operating system and computer readable instructions. When being executed, the computer readable instruction may cause the processor to perform a human head detection method. The processor of the electronic device may include a central processing unit and a graphics processing unit. The processor is configured to provide computing and control capabilities to support operation of the electronic device. The internal memory may store the computer readable instruction. When being executed by the processor, the computer readable instruction may cause the processor to perform a human head detection method. The network interface of the electronic device is configured to be connected to the top view camera. The electronic device may be implemented by a integrated electronic device or a cluster including multiple electronic devices. The electronic device may be a personal computer, a server or a dedicated human head detection device. Those having ordinary skills in the art can understand that the structure shown in FIG. 2 is only a block diagram of a part of the structure related to the solution of this application, and does not constitute limitation on the electronic device to which the solution of this application is applied. The specific electronic device may include more or fewer components than those shown in the figure, or may combine some components, or have different component arrangement.
  • FIG. 3 is a schematic flowchart of a human head detection method according to an embodiment. This embodiment is mainly illustrated by applying the method to the electronic device 110 in above FIG. 1 and FIG. 2. Referring to FIG. 3, the human head detection method specifically includes the following steps:
  • S302: Segment an image to be detected into one or more sub-images.
  • The image to be detected is an image on which human head detection needs to be performed. The image to be detected may be a picture or a video frame in a video. The sub-images are images which are segmented from the image to be detected and have a size smaller than the image to be detected. All segmented sub-images may have the same size or different sizes.
  • Specifically, the electronic device may traverse a window of a fixed size in the image to be detected according to a transverse step size and a longitudinal step size, thereby segmenting the sub-images having the same size as the window size from the image to be detected during the traversal process. The segmented sub-images may be combined to the image to be detected.
  • In an embodiment, step S302 includes: segmenting the image to be detected to one or more sub-images of a fixed size, adjacent sub-images in the segmented sub-images having an overlapping part.
  • The adjacent sub-images refer to that positions of the sub-images in the image to be detected are adjacent, and the adjacent sub-images may partially overlap. Specifically, the electronic device may traverse the window of a fixed size in the image to be detected according to the transverse step size smaller than a window width and the longitudinal step size smaller than a window height, to acquire one or more sub-images of the same size, and adjacent sub-images have an overlapping part.
  • In this embodiment, there is an overlapping part between the segmented adjacent sub-images, thereby ensuring that the adjacent sub-images have higher correlation, and improving accuracy of detecting a human head position from the image to be detected, particularly when a human head lies at boundary between the adjacent sub-images.
  • S304: Input each sub-image to a convolutional neural network trained according to a set of training images having marked human head positions, and output, by a preprocessing layer including at least one convolutional layer and at least one pooling layer in the convolutional neural network, a first feature corresponding to each sub-image.
  • The Convolutional Neural Network (CNN) is an artificial intelligence neural network. The convolutional neural network includes a preprocessing layer having at least one convolutional layer and at least one pooling layer. The convolutional neural network used in this embodiment may be directly constructed, and may alternatively be acquired by reconstructing an existing convolutional neural network. A computational task in the convolutional neural network may be implemented by a central processing unit or a graphics processing unit. Time consumed by the central processing unit for human head detection is proximately a level of seconds, and time consumed by the graphics processing unit for human head detection may be reduced to a level of hundred milliseconds, thereby realizing real-time human head detection.
  • In the convolutional layer in the convolutional neural network, there are a plurality of feature maps, each feature map includes a plurality of neurons, and all neurons of the same feature map share one convolution kernel. The convolution kernel provides a weight of the corresponding neuron, and the convolution kernel represents a feature. The convolution kernel is generally initialized in a form of a random decimal matrix, and a proper convolution kernel will be learned during training of the network to represent a feature. The convolutional layer can reduce connection between various layers in the neural network, and in addition, a risk of overfitting is reduced.
  • Pooling may take two exemplary forms of implementation: mean pooling and max pooling. Pooling may be considered as a special convolutional process. Convolution and pooling greatly simplify complexity of the neural network and reduce parameters of the neural network.
  • The training images having human heads therein may be pre-marked (or labeled) with human head positions For example, human head positions in the training images may be manually marked or labeled, or may be marked or labeled using other automatic means. The training images having the marked human head positions and the image to be detected may be images captured in the similar scene, setting or background, thereby further improving the accuracy of human head detection. The training image having marked human head positions can be of the same size or different sizes as the image to be detected.
  • In an embodiment, when the convolutional neural network is trained, a confidence level may be assigned to the human head position marked in the training image. The training image is segmented into one or more sub-images according to the same segmentation manner as that of the image to be detected. The segmented sub-images are separately input to the convolutional neural network, and the convolutional neural network outputs human head positions and confidence levels. A difference between the output head positions and the marked head position is calculated, and a difference between the corresponding confidence levels is calculated. According to the two differences, parameters of the convolutional neural network are adjusted. The training is continued until a termination condition is reached. The termination condition may be that each difference is less than a preset difference threshold, or the number of iterations reaches a preset number of times.
  • The preprocessing layer is used above as a general term of other layers in the convolutional neural network except for the regression layer and a convolutional layer before the regression layer. The preprocessing layer may include at least one convolutional layer and at least one pooling layer. The preprocessing layer may include parallel convolutional layers, and data output by the parallel convolutional layers may be spliced and input to a next layer. The last layer in the preprocessing layer may be a convolutional layer or a pooling layer. The preprocessing layer may include multiple pairs of convolutional layer and pooling layer connected in tandem. The preprocessing layer may include additional rectifying layers.
  • S306: Map, through a convolutional layer after the preprocessing layer in the convolutional neural network, a first feature corresponding to each sub-image to a second feature corresponding to each sub-image.
  • A conventional convolutional neural network is generally used for classification, and the preprocessing layer in the convolutional neural network for classification is followed by a fully connected layer. The fully connected layer may map the first feature output by the preprocessing layer to probability data corresponding to each preset type (or class). Therefore, a type to which an input image belongs may be determined by the regression layer. In this embodiment, the convolutional neural network is used for human head detection rather than classification. As such, the convolutional layer is configured to replace the fully connected layer, and to output the second feature for describing the sub-image features. The number of the second features corresponding to each sub-image may be plural.
  • S308: Map, through a regression layer in the convolutional neural network, the second feature corresponding to each sub-image to a human head position corresponding to each sub-image and a confidence level corresponding to the human head position.
  • The human head position may be represented by a position of a rectangular box bounding a human head in the image. The position of the rectangular box may be represented by a quadruple. The quadruple may include a horizontal coordinate and a longitudinal coordinate of one vertex of the rectangular box and a width and a height of the rectangular box. Alternatively, the quadruple may include a horizontal coordinate and a longitudinal coordinate of each of two diagonal vertexes of the rectangular box. The confidence level output by the regression layer are in a one-to-one correspondence with the human head position output by the regression layer, thereby indicating a probability that the corresponding rectangular box does correspond to a human head at the corresponding position in the image. The regression layer may use a support vector machine (SVM).
  • In an embodiment, step S308 includes: mapping, through the convolutional layer in the regression layer in the convolutional neural network, the second feature corresponding to each sub-image to the human head position corresponding to each sub-image and the confidence level corresponding to the human head position. Specifically, the electronic device may directly map the second feature corresponding to each sub-image to the human head position corresponding to each sub-image and the confidence level corresponding to the human head position through the same convolutional layer in the regression layer in the convolutional neural network.
  • In an embodiment, step S308 includes: mapping, through a first convolutional layer in the regression layer in the convolutional neural network, the second feature corresponding to each sub-image to the human head position corresponding to each sub-image; and mapping, through a second convolutional layer in the regression layer in the convolutional neural network, the second feature corresponding to each sub-image to the confidence level corresponding to the output human head position.
  • For example, referring to FIG. 4, the sub-image outputs 128 feature matrices (feature maps) each with a size M*N through the preprocessing layer in the convolutional neural network. 128 is a preset value for number of features, and can be set as needed. M and N are determined by parameters of the preprocessing layer. The 128 feature matrices with the size M*N are input to the convolutional layer after the preprocessing layer. By performing convolution processing by using a parameter matrix with a size 128*1024 in the convolutional layer, M*N feature vectors with a length 1024 are output. The M*N feature vectors with the length 1024 are input to the first convolutional layer in the regression layer, and are convoluted by a parameter matrix with a size 1024*4 in the first convolutional layer, and M*N quadruples representing the human head position are output. The M*N feature vectors with the length 1024 are input to the second convolutional layer in the regression layer, and are convoluted by a parameter vector with a size 1024*1 in the second convolutional layer, and M*N tuples indicating the confidence level the human head position are output. A correspondence relationship between the human head position and the confidence level is embodied in an order of the output M*N quadruples and the tuples.
  • S310: Filter, according to the corresponding confidence level, the human head position corresponding to each sub-image, and acquire a human head position detected in the image to be detected.
  • Specifically, the electronic device may compare the confidence level of each human head position output by the convolutional neural network with a confidence level threshold, and filter out human head positions of which confidence levels are less than the confidence level threshold. The electronic device may further filter the human head positions, of which areas are smaller than a preset area, in the human head positions filtered by using the confidence level threshold. The electronic device may cluster the filtered human head positions to combine the plurality of human head positions of the same type to acquire one combined human head position in the image to be detected, or select one of the plurality of human head positions clustered to the same type as the human head position in the image to be detected.
  • According to the foregoing human head detection method, the convolutional neural network is trained in advance based on the training images having the marked human head position, and the convolutional neural network can automatically learn human head features. The trained convolutional neural network can automatically extract appropriate features from the sub-images to output candidate human head positions and corresponding confidence levels, and then filter, according to the confidence levels, to acquire the human head position in the image to be detected. The human head shape is learned rather than pre-assumed. As such a missed detection caused by presuming the shape of the human head can be avoided, and accuracy of the human head detection is improved. Moreover, in the convolutional neural network, the first features of the sub-images are output by the preprocessing layer including the convolutional layer and the pooling layer, and the second features are outputted by the convolutional layer after the preprocessing layer and before the regression layer to accurately describe human head features in the sub-images. Therefore, the second features are directly mapped to the human head positions and confidence levels by the regression layer, which is new application of the convolutional neural network of the new structure. Compared with the traditional circle detection, the accuracy of the human head detection is greatly improved.
  • In an embodiment, before step S302, the human head detection method further includes a step of converting and training the convolutional neural network for classification to a convolutional neural network for human head detection. Referring to FIG. 5, the step of converting and training the convolutional neural network for classification to a convolutional neural network for human head detection includes the following steps:
  • S502: Convert a fully connected layer after the preprocessing layer and before the regression layer included in the convolutional neural network for classification to a convolutional layer.
  • A conventional convolutional neural network for classification is a trained convolutional neural network which can classify images input to the convolutional neural network, such as GoogleNet, VGGNET, or AlexNet. The convolutional neural network for classification includes the preprocessing layer, the fully connected layer, and the regression layer. The fully connected layer is configured to output second features corresponding to each preset type (or class) of the conventional classification application.
  • The sparse connection and weight sharing of the fully connected layer and the convolutional layer are different. Each neuron of the fully connected layer is connected to all neurons of a preceding layer. Both the convolutional layer and the fully connected layer acquire input of a next layer by multiplying output of the preceding layer by a parameter matrix. As such, the conventional fully connected layer can be converted to the convolutional layer by changing an arrangement manner of parameters of the fully connected layer.
  • S504: Replace the regression layer in the convolutional neural network for classification with a regression layer configured to map the second feature output by the converted convolutional layer to the human head position and the corresponding confidence level.
  • In the conventional convolutional neural network for classification, the regression layer is configured to map the second features of each preset type output by the fully connected layer to a probability corresponding to each preset type, and determine, according to the mapped probability, a preset type to which the image belongs. For example, a preset type corresponding to a maximum probability is selected as the preset type to which the input image belongs.
  • In the convolutional neural network for human head detection of this disclosure, the regression layer is configured to map a preset number of second features output by the converted convolutional layer to the human head positions and the confidence levels corresponding to the human head positions. The regression layer may use a convolutional layer. The convolutional layer directly maps the second features to the human head positions and the confidence levels corresponding to the human head positions. The regression layer may also use two convolutional layers in parallel. One convolutional layer is configured to map the second features to the human head positions, and the other convolutional layer is configured to map the second features to the confidence levels corresponding to the mapped human head positions.
  • S506: Train the convolutional neural network including the preprocessing layer, the converted convolutional layer, and the replaced regression layer by using the training images having the marked human head positions.
  • The convolutional neural network including the preprocessing layer, the converted convolutional layer and the replaced regression layer is reconstructed and modified from the conventional convolutional neural network for classification applications. In one implementation, parameters of the preprocessing layer may be pre-trained. Then for the reconstructed convolutional neural network, mainly the parameters in the converted convolutional layer and the replaced regression layer need to be trained. The training processing may be joint process. For example, the entire network may be trained. The preprocessing layer training parameters may be initialized to its pre-trained parameters and retrained together with the rest of network.
  • Specifically, when the reconstructed convolutional neural network is trained, the confidence level may be pre-assigned to the marked human head positions of the training image. The training image is segmented into one or more sub-images according to the same segmenting manner as that of the image to be detected. The segmented sub-images are respectively input to the convolutional neural network, and the human head positions and the confidence levels are output by the preprocessing layer, the convolutional layer after the preprocessing layer, and the regression layer of the convolutional neural network. The difference between the output human head positions and the marked human head position is calculated, and the difference between the corresponding confidence levels is calculated, and the parameters in the preprocessing layer, the convolutional layer after the preprocessing layer, and the regression layer in the convolutional neural network are adjusted according to the two differences. The training is continued until a termination condition is reached. The termination condition may be that the difference is less than a preset difference, or the number of iterations reaches a preset number of times.
  • In this embodiment, the training is performed after reconstruction of conventional neural network for classification into a the convolutional neural network for human head detection. The reconstruction does not require complete redesign of the neural network, training duration can be reduced and efficiency of human head detection is improved.
  • As shown in FIG. 6, in an embodiment, step S310 specifically includes the following steps:
  • S602: Screen, from the human head positions corresponding to the sub-images, to acquire a human head position corresponding to a confidence level greater than or equal to a confidence level threshold.
  • Specifically, the electronic device may form the human head positions respectively corresponding to the sub-images segmented from the image to be detected into a human head position set, traverse the human head position set, and compare confidence levels the traversed human head positions with the confidence level threshold. The human head positions having confidence levels lower than the confidence level threshold may be removed from the human head position set. The remaining human head positions in the human head position set after the traversing are the acquired human head positions of which the corresponding confidence levels are greater than or equal to the confidence level threshold. The confidence level threshold may be set as needed, for example, may be valued from 0.5 to 0.99.
  • S604: Selecting or identifying human head positions among the screened human head positions in S602 that intersect in the image to be detected.
  • The intersection of the human head positions means that enclosed areas indicated by respective human head positions have an intersection in the image to be detected. When the human head position is represented by a position of a rectangular box including the human head image, the intersection of the human head positions is the intersection of the corresponding rectangular boxes. Specifically, the electronic device may select a human head position intersecting with the acquired human head position in the image to be detected from the human head position set formed by the human head positions respectively corresponding to all the sub-images segmented from the image to be detected. The electronic device may also seek for the intersecting human head positions from only the acquired human head positions.
  • S606: Determine, according to the acquired human head position and the identified human head position, the human head position detected in the image to be detected.
  • Specifically, the electronic device may classify the acquired human head positions and the selected human head positions. Each type includes at least one of the acquired human head positions, and also includes human head positions intersecting with the at least one human head position. The electronic device may combine the human head positions of each type to one human head position as a detected head position, or select one human head position from the human head positions of each type as the detected human head position.
  • In this embodiment, the accuracy of human head detection can be further improved by using the confidence levels and the position intersection the basis for determining the human head position in the image to be detected.
  • As shown in FIG. 7, in an embodiment, step S606 specifically includes the following steps:
  • S702: Use the acquired human head position (from step 602) and the selected human head position (from step 604) as nodes in a bipartite graph, as a first group and second group, respectively.
  • The bipartite graph is a graph in the graph theory, the nodes in the bipartite graph may be segmented into two groups, and all edges connected to the nodes are caused to span boundaries of the groups.
  • S704: Assign default and positive weights to the edges between the nodes in the bipartite graph.
  • There is an edge between each acquired human head position and the correspondingly selected intersecting human head position. The default and positive weight is a positive value, such as 1000.
  • S706: Reduce the correspondingly assigned weights for the corresponding edges that are associated with nodes representing intersecting head positions.
  • Specifically, when the human head positions indicated by the nodes associated with the edges intersect, the electronic device may subtract a positive value less than the default and positive weight from the correspondingly assigned weight, and then divide the subtracted value by the default and positive weight to acquire an updated weight. If the default and positive weight is 1000, and the positive value less than the default and positive weight is 100, then the updated weight is (1000−100)/1000=0.9.
  • S708: Solve a maximum weight edge combination of the bipartite graph, and acquire the human head position detected in the image to be detected.
  • An edge combination in the bipartite graph is a set of edges have no common nodes. If a particular weight sum of edges of one of all the edge combinations of one bipartite graph is largest, this particular edge combination is referred to as the maximum weight edge combination. The electronic device may traverse all edge combinations in the bipartite graph to find the maximum weight edge combination. The electronic device may also use a Kuhn-Munkres algorithm to solve the maximum weight edge combination of the bipartite graph. After the maximum weight edge combination is solved, the human head positions associated with the edges in the maximum weight edge combination can be used as the human head position detected in the image to be detected.
  • In this embodiment, since the intersecting human head positions may correspond to the same human head, the human head positions output by the convolutional neural network are mostly gathered near the actual human head position in the image to be detected. Therefore, the acquired human head positions (step 602, for example) and the selected human head positions (step 605, for example) are used as the nodes in the bipartite graph to construct the bipartite graph, and weights of the corresponding edges of the intersecting human head positions are reduced. By solving the maximum weight edge combination, the detected human head position in the image to be detected are acquired, and the human head detection can be performed more accurately.
  • In an embodiment, the image to be detected may be a video frame in a video, and the human head detection method further includes a step of performing human head tracking and performing people counting frame by frame. Referring to FIG. 8, the step of performing human head tracking and performing people counting frame by frame specifically includes the following steps:
  • S802: Perform human head tracking video according to the human head position detected in the image frame to be detected frame by video frame.
  • Specifically, after detecting the human head position in one video frame, the electronic device performs the human head tracking video frame by video frame by using the detected human head position as a starting point. The electronic device may specifically use a mean shift (average drift) tracking algorithm, an optical flow tracking algorithm, or a tracking-learning-detection (TLD) algorithm.
  • S804: Determine a moving direction and a positional relationship of the tracked human head position relative to a designated area.
  • The designated area refers to the area designated in the video frame. The moving direction of the tracked human head position relative to the designated area refers to that the human head position is, for example, moving toward or away from the designated area. The positional relationship of the tracked human head position relative to the designated area refers to that the human head position is inside or outside the designated area.
  • In an embodiment, when the tracked human head position crosses a line representing a boundary of the designated area in a direction toward the designated area, it is determined that the tracked human head position enters the designated area. When the tracked human head position crosses the line representing the boundary of the designated area in a direction away from the designated area, it is determined that the tracked human head position leaves the designated area.
  • In an embodiment, when the tracked human head position sequentially crosses a first line and a second line parallel with the first line, it is determined that the tracked human head position enters the designated area. When the tracked human head position sequentially crosses the second line and the first line, it is determined that the tracked human head position leaves the designated area.
  • The parallel first line and second line may be straight lines or curved lines. The designated area may be one of two areas formed by segmenting the image to be detected by the second line, without including the first line. In this embodiment, the moving direction and the positional relationship of the tracked human head position relative to the designated area are determined by the two lines, thereby preventing a judgment error caused by movement of the human head position in the vicinity of the boundary of the designated area, thereby ensuring the correctness of people counting.
  • S806: Perform people counting according to the determined moving direction and positional relationship.
  • The people counting may be specifically counting a combination of one or more of the number of accumulated people entering the designated area, the number of accumulated people leaving the designated area, and the dynamic number of people entering the designated area. Specifically, the electronic device may add 1 to the number of statistically accumulated people entering the designated area, and/or add 1 to the number of dynamic people entering the designated area when one tracked human head position enters the designated area. The electronic device may add 1 to the number of statistically accumulated people leaving the designated area, and/or subtract 1 from the number of dynamic people entering the designated area when one tracked human head position leaves the designated area
  • In this embodiment, the human head detection may applied to security application. The people counting is performed according to the moving direction and the positional relationship of the tracked human head position relative to the designated area. Based on accurate human head detection, the accuracy of people counting can be improved.
  • In an embodiment, the human head detection method further includes a step of detecting the human head position and continuing tracking near the human head position tracked in a previous video frame when the tracking of the human head position is interrupted. Referring to FIG. 9, the step specifically includes the following steps:
  • S902: Track and record the human head position video frame by video frame.
  • Specifically, the electronic device tracks the detected human head position with the detected human head position in the image to be detected as a starting point, and records the tracked human head position.
  • S904: Acquire a human head position tracked in a previous recorded video frame if the tracking of the human head position in a current video frame is interrupted.
  • Specifically, when a character moves quickly or lighting changes, the tracking of the human head position may be interrupted, and in this case, the human head position tracked in the previous video frame and recorded during the tracking video frame by video frame is acquired.
  • S906: Detect human head positions in a local area covering the acquired human head position (in step 904) in the current video frame.
  • The local area covering the acquired human head position is smaller than a size of one video frame, and larger than a size of the area occupied by the human head position tracked in the previous video frame. A shape of the local area may be similar to a shape of the area occupied by the human head position tracked in the previous video frame. A center of the local area may overlap with a center of the area occupied by the human head position tracked in the previous video frame.
  • Specifically, the electronic device may detect the human head positions in the current video frame to find the human head positions belonging to the local area. The electronic device may also detect the human head positions only in the local area. The electronic device may specifically use the steps of steps S302 to S310 to detect the human head positions in the local area in the current video frame. The detected human head positions may be partially or entirely located in the local area. The electronic device may use the human head positions of which the centers are within the local area as the human head positions in the detected local area, and the human head positions of which the centers are outside the local area do not belong to the human head positions in the local area.
  • For example, when the human head position is represented by a position of a rectangular box including the human head image, if a width of the rectangular box tracked in the previous video frame is W and a height is H, a and b are set to coefficients greater than 1, then the local area may be the rectangular area having a width of a*W and a height of b*H and the same center as the rectangular box. If the center coordinates of the rectangular box tracked in the previous video frame are (X1, X2) and the center coordinates of another rectangular box indicating the human head position are (X2, Y2), then when |X1−X2|<W/2 and |Y1−Y2|<H/2, the rectangular box of which the center coordinates are (X2, Y2) is determined to be in the local area of the rectangular box of which the center coordinates are (X1, X2).
  • S908: Continue to perform step S902 from the human head position detected in the local area.
  • In this embodiment, when the tracking of the human head positions is interrupted, the human head positions can be detected from the vicinity of the human head positions detected in the previous frame, and the interrupted human head tracking can be recovered from the interruption and continued. The human head detection and the human head tracking are combined to ensure the continuity of the tracking. Further, the accuracy of people counting is ensured.
  • The specific principle of the foregoing human head detection method is described below with a specific application scenario. A large number of top view images at an elevator entrance scene are acquired in advance, and the human head positions in these top view images are marked or labeled. For example, a quadruple is used to indicate the position of the human head image in a rectangular box 1001 in FIG. 10. A convolutional neural network for classification is selected, the fully connected layer after the preprocessing layer and before the regression layer is converted to a convolutional layer, and the regression layer therein is replaced with the regression layer configured to map the second feature output by the converted convolutional layer to the human head position and the corresponding confidence level, thereby retraining the convolutional neural network by using the marked top view image.
  • Referring to FIG. 11, in actual application, if the number of people entering and exiting a gate needs to be counted, a top view camera is disposed above a gate, and the videos are captured by the top view camera and transmitted to an electronic device connected to the top view camera. The electronic device uses an image area sandwiched by a line 1101 and a line 1104 in one of the video frames as an image to be detected, and segments the image to be detected into one or more sub-images. Each sub-image is input to a convolutional neural network trained by training images having a marked human head positions. The convolutional neural network outputs the human head positions corresponding to each sub-image and the confidence level corresponding to the human head positions, thereby filtering, according to the corresponding confidence level, the human head positions corresponding to each sub-image, and acquiring the human head positions detected in the image to be detected.
  • Further, the electronic device performs human head tracking video frame by video frame according to the human head position detected in the image to be detected, and it is determined that a tracked human head position 1105 enters a designated area when the tracked human head position 1105 sequentially crosses a first line 1102 and a second line 1103 parallel with the first line 1102. When a tracked human head position 1106 sequentially crosses the second line 1103 and the first line 1102, it is determined that the tracked human head position 1106 leaves the designated area. The designated area in FIG. 11 may be specifically the area sandwiched by the second line 1103 and a line 1104.
  • In an embodiment, an electronic device is further provided, and an internal structure of the electronic device may be shown in FIG. 2. The electronic device includes a human head detection apparatus. The human head detection apparatus includes various modules, and the modules may be all or partially implemented by software, hardware or a combination thereof.
  • FIG. 12 is a structural block diagram of a human head detection apparatus 1200 according to an embodiment. Referring to FIG. 12, the human head detection apparatus 1200 includes a segmenting module 1210, a convolutional neural network module 1220, and a human head detection result determining module 1230.
  • The segmenting module 1210 is configured to segment an image to be detected into one or more sub-images.
  • The convolutional neural network module 1220 is configured to segment the image to be detected into one or more sub-images; input each sub-image to a convolutional neural network trained according to training images having marked human head positions, and output, by a preprocessing layer including at least one convolutional layer and at least one pooling layer in the convolutional neural network, a first feature corresponding to each sub-image; map, through the convolutional layer after the preprocessing layer in the convolutional neural network, the first feature corresponding to each sub-image to a second feature corresponding to each sub-image; and map, through a regression layer of the convolutional neural network, the second feature corresponding to each sub-image to a human head position corresponding to each sub-image and a corresponding confidence level of the human head position.
  • The human head detection result determining module 1230 is configured to filter, according to the corresponding confidence level, the human head position corresponding to each sub-image, to acquire a human head position detected in the image to be detected.
  • According to the human head detection apparatus 1200, the convolutional neural network is trained in advance based on the training image having the marked human head position, and the convolutional neural network can automatically learn human head features. The trained convolutional neural network can automatically extract appropriate features from the sub-images to output candidate human head positions and corresponding confidence levels, and then filter, according to the confidence levels, to acquire the human head position in the image to be detected. The human head shape is not required to be assumed in advance, a missed detection caused by setting the human head shape can be avoided, and accuracy of the human head detection is improved. Moreover, in the convolutional neural network, the first features of the sub-images are output by the preprocessing layer including the convolutional layer and the pooling layer, and the second features are outputted by the convolutional layer after the preprocessing layer and before the regression layer to accurately describe human head features in the sub-images. Therefore, the second features are directly mapped to the human head positions and confidence levels by the regression layer, which is new application of the convolutional neural network of the new structure. Compared with the traditional circle detection, the accuracy of the human head detection is greatly improved.
  • In an embodiment, the segmenting module 1210 is further configured to segment the image to be detected into one or more sub-images of a fixed size, and adjacent sub-images in the segmented sub-images have an overlapping part. In this embodiment, there is an overlapping part between the adjacent segmented sub-images, thereby ensuring that the adjacent sub-images have stronger correlation, and improving accuracy of detecting a human head position from the image to be detected.
  • As shown in FIG. 13, in an embodiment, the human head detection apparatus 1200 further includes a convolutional neural network adjusting module 1240 and a training module 1250.
  • The convolutional neural network adjusting module 1240 is configured to convert a fully connected layer after the preprocessing layer and before the regression layer included in the convolutional neural network for classification to a convolutional layer; and replace a regression layer in the convolutional neural network for classification with a regression layer configured to map the second feature output by the converted convolutional layer to the human head position and the corresponding confidence level.
  • The training module 1250 is configured to train the convolutional neural network including the preprocessing layer, the converted convolutional layer and the replaced regression layer by using the training image having the marked human head position.
  • In this embodiment, the training after reconstruction is performed based on the convolutional neural network for classification, to acquire the convolutional neural network for human head detection. The reconstruction of the convolutional neural network is not required, the training duration can be reduced and the efficiency of human head detection is improved.
  • In an embodiment, the convolutional neural network module 1220 is further configured to map, through a first convolutional layer in the regression layer in the convolutional neural network, the second feature corresponding to each sub-image to a human head position corresponding to each sub-image; and map, through a second convolutional layer in the regression layer in the convolutional neural network, the second feature corresponding to each sub-image to a confidence level corresponding to the output human head position.
  • As shown in FIG. 14, in an embodiment, the human head detection result determining module 1230 includes a filtering module 1231 and a head position determining module 1232.
  • The filtering module 1231 is configured to screen, from the human head positions corresponding to the sub-images, to acquire a human head position corresponding to a confidence level greater than or equal to a confidence level threshold; and select a human head position intersecting with the acquired human head position in the image to be detected from the human head positions corresponding to the sub-images.
  • The human head position determining module 1232 is configured to determine, according to the acquired human head position and the selected human head position, the human head position detected in the image to be detected.
  • In this embodiment, the accuracy of the human head detection can be further improved by using the confidence levels and the intersection or not as the basis for determining the human head position in the image to be detected.
  • In an embodiment, the human head position determining module 1232 is further configured to use the acquired human head position and the selected human head position as nodes in a bipartite graph; assign default and positive weights to edges between the nodes in the bipartite graph; reduce the corresponding assigned weights when the human head positions indicated by the nodes associated with the edges intersect; and solve a maximum weight edge combination of the bipartite graph, to acquire the head position detected in the image to be detected.
  • In this embodiment, since the intersecting human head position are likely to correspond to the same human head, the human head positions output by the convolutional neural network are mostly gathered near the actual human head position in the image to be detected. Therefore, the acquired human head positions and the selected human head positions are used as nodes in the bipartite graph to construct the bipartite graph, and weights of the corresponding edges of the intersecting human head positions are relatively small. By solving the maximum weight edge combination, the human head position detected in the image to be detected are acquired, and the human head detection can be performed more accurately.
  • As shown in FIG. 15, in an embodiment, the image to be detected is a video frame in a video. The human head detection apparatus 1200 further includes:
  • a tracking module 1260, configured to perform head tracking video frame by video frame according to the human head position detected in the image to be detected;
  • a counting condition detecting module 1270, configured to determine a moving direction and a positional relationship of the tracked human head position relative to the designated area; and
  • a people counting module 1280, configured to perform people counting based on the determined moving direction and positional relationship.
  • In this embodiment, the human head detection is applied to the field of security. The people counting is performed according to the moving direction and the positional relationship of the tracked human head position relative to the designated area. Based on accurate human head detection, the accuracy of people counting can be ensured.
  • In an embodiment, the counting condition detecting module 1270 is further configured to: determine that the tracked human head position enters the designated area when the tracked human head position sequentially spans a first line and a second line parallel with the first line; and determine that the tracked human head position leaves the designated area when the tracked human head position sequentially span the second line and the first line.
  • In this embodiment, the moving direction and the positional relationship of the tracked human head position relative to the designated area are determined by two lines, thereby preventing a judgment error caused by the moving of the human head position near a boundary of the designated area, thereby ensuring the correctness of people counting.
  • As shown in FIG. 16, in an embodiment, the human head detecting module 1200 further includes a human head position acquiring module 1290.
  • The tracking module 1260 is further configured to track and record the human head position video frame by video frame.
  • The human head position acquiring module 1290 is configured to acquire a human head position tracked in a previous recorded video frame if the tracking of the human head position in a current video frame is interrupted.
  • The convolutional neural network module 1220 is further configured to detect human head positions in a local area covering the acquired head position in the current video frame.
  • The tracking module 1260 is further configured to continue to perform the step of tracking and recording the human head position video frame by video frame from the human head positions detected in the local area.
  • In this embodiment, when the tracking of the human head positions is interrupted, the human head positions can be detected from the vicinity of the human head positions detected in the previous frame, and the interrupted human head tracking can be continued. The human head detection and the human head tracking are combined to ensure the continuity of the tracking. Further, the accuracy of people counting is ensured.
  • It should be understood that the steps in various embodiments of this application are not necessarily performed in an order indicated by the step numbers. Unless explicitly described in this specification, there is no strict sequence for execution of the steps. In addition, at least some steps in the embodiments may include a plurality of substeps or a plurality of stages. The substeps or the stages are not necessarily performed at a same moment, and instead may be performed at different moments. A performing sequence of the substeps or the stages is not necessarily performing in sequence, and instead may be performed in turn or alternately with another step or at least some of substeps or stages of the another step.
  • A person of ordinary skill in the art may understand that all or some of the processes of the methods in the foregoing embodiments may be implemented by a computer program instructing relevant hardware. The program may be stored in a non-volatile computer-readable storage medium. When the program runs, the processes of the foregoing methods in the embodiments are performed. The memory, storage, database or any other media in the embodiments provided in this application may include a non-volatile and/or volatile memory. The non-volatile memory may include: a read-only memory (ROM), a programmable ROM (PROM), an electrically programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), or a flash memory. The volatile memory may include a random access memory (RAM) or an external cache memory. By way of illustration and not limitation, the RAM may be implemented in various forms, such as a static RAM (SRAM), a dynamic RAM (DRAM), a synchronous DRAM (SDRAM), a double data rate SDRAM (DDRSDRAM), an enhanced SDRAM (ESDRAM), a synchlink DRAM (SLDRAM), a rambus direct RAM (RDRAM), a direct rambus dynamic RAM (DRDRAM), and a rambus dynamic RAM (RDRAM).
  • Various technical features in the foregoing embodiments may be randomly combined. For ease of description, not all possible combinations of the various technical features in the foregoing embodiments are described. However, the combinations of the technical features should be considered as falling within the scope recorded in this specification as long as the combinations of the technical features are compatible with each other.
  • The foregoing embodiments only describe several implementations of this application, which are described specifically and in detail, and therefore cannot be construed as a limitation to the patent scope of the present disclosure. It should be noted that, a person of ordinary skill in the art may make various changes and improvements without departing from the ideas of this application, which shall all fall within the protection scope of this application. Therefore, the protection scope of the patent of this application shall be subject to the appended claims.

Claims (20)

What is claimed is:
1. A method for detecting human head in an image performed by an electronic device comprising a processor, the method comprising:
segmenting, by the electronic device, the image into one or more sub-images;
inputting, by the electronic device, each sub-image to a convolutional neural network trained according to training images having marked human head positions, and outputting, by a preprocessing layer of the convolutional neural network comprising a first convolutional layer and a pooling layer, a first feature corresponding to each sub-image;
mapping, by the electronic device through a second convolutional layer after the preprocessing layer in the convolutional neural network, the first feature corresponding to each sub-image to a second feature corresponding to each sub-image;
mapping, by the electronic device through a regression layer in the convolutional neural network, the second feature corresponding to each sub-image to a human head position corresponding to each sub-image and a corresponding confidence level of the human head position; and
filtering, by the electronic device according to the corresponding confidence level, human head positions corresponding to the one or more sub-images, to acquire detected human head positions in the image.
2. The method according to claim 1, wherein segmenting, by an electronic device, the image to be detected into one or more sub-images comprises:
segmenting, by the electronic device, the image into one or more sub-images of a fixed size, wherein adjacent sub-images in the one or more sub-images partially overlap.
3. The method according to claim 1, wherein:
a fully connected layer in a conventional convolution neural network is converted to the second convolutional layer;
a conventional regression layer in a conventional convolutional neural network for image classification is replaced by the regression layer for mapping the second feature output by the second convolutional layer to the human head position and the corresponding confidence level; and
the method further comprises training, by the electronic device, the convolutional neural network comprising the preprocessing layer, the second convolutional layer, and the regression layer by using the training images having the marked human head positions.
4. The method according to claim 1, wherein mapping, by the electronic device through the regression layer in the convolutional neural network, the second feature corresponding to each sub-image to a human head position corresponding to each sub-image and a corresponding confidence level of the human head position comprises:
mapping, by the electronic device through a third convolutional layer in the regression layer of the convolutional neural network, the second feature corresponding to each sub-image to the human head position corresponding to each sub-image; and
mapping, by the electronic device through a fourth convolutional layer in the regression layer of the convolutional neural network, the second feature corresponding to each sub-image to the confidence level corresponding to the human head position.
5. The method according to claim 1, wherein filtering, by the electronic device according to the corresponding confidence level, the human head positions corresponding to the one or more sub-images, to acquire the detected human head positions in the image comprises:
screening, by the electronic device from the human head positions corresponding to the one or more sub-images, to acquire screened human head positions corresponding to confidence levels greater than or equal to a predetermined confidence level threshold;
selecting, by the electronic device, human head positions intersecting with the screened human head positions from the screened human head positions to obtain overlapped human head positions; and
determining, by the electronic device according to the screened human head positions and the overlapped human head positions, the detected human head positions of the image.
6. The method according to claim 5, wherein determining, by the electronic device according to the screened human head positions and the overlapped human head positions, the detected human head positions of the image comprises:
using, by the electronic device, the screened human head positions and the overlapped human head positions as nodes in a bipartite graph;
assigning, by the electronic device, default and positive weights to edges between the nodes in the bipartite graph;
reducing, by the electronic device, weights of edges in the bipartite graph associated with the overlapped human head positions; and
solving, by the electronic device, a maximum weight edge combination of the bipartite graph to obtain the detected human head positions of the image.
7. The method according to claim 1, wherein the image comprises a video frame in a video, and the method further comprises:
performing, by the electronic device, human head tracking according to the detected human head positions video frame by video frame;
determining, by the electronic device, a moving direction and a positional relationship of each of the tracked human head positions relative to a designated area; and
performing, by the electronic device, people counting according to the moving direction and positional relationship of each of the tracked hu8man head positions.
8. The method according to claim 7, wherein determining, by the electronic device, the moving direction and the positional relationship of the tracked human head position relative to the designated area comprises:
determining, by the electronic device, that the tracked human head position enters the designated area when the tracked human head position sequentially crosses a first line and a second line parallel with the first line; and
determining, by the electronic device, that the tracked human head position leaves the designated area when the tracked human head position sequentially crosses the second line and the first line.
9. The method according to claim 7, wherein the method further comprises:
tracking and recording, by the electronic device, the detected human head positions video frame by video frame;
acquiring, by the electronic device, a human head position tracked in a previous video frame if the tracking of the human head position in a current video frame is interrupted;
detecting, by the electronic device, a recovered human head position in the current video frame within a local area covering the acquired human head position in the previous video frame; and
continuing, by the electronic device, tracking and recording the recovered human head position video frame by video frame.
10. An electronic device for detecting human head in an image, comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions, when executed by the processor, causing the processor to perform the following steps:
segmenting the image into one or more sub-images;
inputting each sub-image to a convolutional neural network trained according to training images having marked human head positions, and outputting, by a preprocessing layer of the convolutional neural network comprising a first convolutional layer and a pooling layer, a first feature corresponding to each sub-image;
mapping, through a second convolutional layer after the preprocessing layer in the convolutional neural network, the first feature corresponding to each sub-image to a second feature corresponding to each sub-image;
mapping, through a regression layer in the convolutional neural network, the second feature corresponding to each sub-image to a human head position corresponding to each sub-image and a corresponding confidence level of the human head position; and
filtering, according to the corresponding confidence level, human head positions corresponding to the one or more sub-images, to acquire detected human head positions in the image.
11. The electronic device according to claim 10, wherein segmenting, by an electronic device, the image into one or more sub-images comprises:
segmenting the image into one or more sub-images of a fixed size, wherein adjacent sub-images in the one or more sub-images partially overlap.
12. The electronic device according to claim 10, wherein:
a fully connected layer in a conventional convolution neural network is converted to the second convolutional layer;
a conventional regression layer in a conventional convolutional neural network for image classification is replaced by the regression layer for mapping the second feature output by the second convolutional layer to the human head position and the corresponding confidence level; and
the computer readable instructions causes the processor to further perform the step of training, by the electronic device, the convolutional neural network comprising the preprocessing layer, the second convolutional layer, and the regression layer by using the training images having the marked human head positions.
13. The electronic device according to claim 10, wherein mapping, through a regression layer in the convolutional neural network, the second feature corresponding to each sub-image to a human head position corresponding to each sub-image and a corresponding confidence level of the human head position comprises:
mapping, through a third convolutional layer in the regression layer of the convolutional neural network, the second feature corresponding to each sub-image to the human head position corresponding to each sub-image; and
mapping, through a fourth convolutional layer in the regression layer of the convolutional neural network, the second feature corresponding to each sub-image to the confidence level corresponding to the human head position.
14. The electronic device according to claim 10, wherein filtering, according to the corresponding confidence level, the human head position corresponding to the one or more sub-images, to acquire the detected human head positions in the image to be detected comprises:
screening, from the human head positions corresponding to the one or more sub-images, to acquire screened human head positions corresponding to confidence levels greater than or equal to a predetermined confidence level threshold;
selecting human head positions intersecting with the screened human head positions from the screened human head positions to obtain overlapped human head positions; and
determining, according to the screened human head positions and the overlapped human head positions, the detected human head positions.
15. The electronic device according to claim 14, wherein determining, according to the screened human head positions and the overlapped human head positions, the detected human head positions in the image comprises:
using the screened human head positions and the overlapped human head positions as nodes in a bipartite graph;
assigning default and positive weights to edges between the nodes in the bipartite graph;
reducing weights of edges in the bipartite graph associated with the overlapped human head positions; and
solving a maximum weight edge combination of the bipartite graph to obtain the detected human head positions in the image.
16. The electronic device according to claim 10, wherein the image comprises a video frame in a video; and the computer readable instructions further causes the processor to perform the following steps:
performing human head tracking according to the detected human head positions video frame by video frame;
determining a moving direction and a positional relationship of each of the tracked human head positions relative to a designated area; and
performing people counting according to the moving direction and positional relationship of each of the tracked hu8man head positions.
17. The electronic device according to claim 16, wherein determining the moving direction and the positional relationship of the tracked human head position relative to the designated area comprises:
determining that the tracked human head position enters the designated area when the tracked human head position sequentially crosses a first line and a second line parallel with the first line; and
determining that the tracked human head position leaves the designated area when the tracked human head position sequentially crosses the second line and the first line.
18. The electronic device according to claim 16, wherein the computer readable instructions further causes the processor to perform the following steps:
tracking and recording the detected human head positions video frame by video frame;
acquiring a human head position tracked in a previous video frame if the tracking of the human head position in a current video frame is interrupted;
detecting a recovered human head position in the current video frame within a local area covering the acquired human head position in the previous video frame; and
continuing tracking and recording the recovered human head position video frame by video frame.
19. A non-volatile storage medium for storing computer readable instructions, the computer readable instructions, when executed by one or more processors, causing the one or more processors to perform human head detection in an image by the following steps:
segmenting the image into one or more sub-images;
inputting each sub-image to a convolutional neural network trained according to training images having marked human head positions, and outputting, by a preprocessing layer of the convolutional neural network comprising a first convolutional layer and a pooling layer, a first feature corresponding to each sub-image;
mapping, through a second convolutional layer after the preprocessing layer in the convolutional neural network, the first feature corresponding to each sub-image to a second feature corresponding to each sub-image;
mapping, through a regression layer in the convolutional neural network, the second feature corresponding to each sub-image to a human head position corresponding to each sub-image and a corresponding confidence level of the human head position; and
filtering, according to the corresponding confidence level, human head positions corresponding to the one or more sub-images, to acquire detected human head positions in the image.
20. The non-volatile storage medium according to claim 19, wherein segmenting the image into one or more sub-images comprises:
segmenting the image into one or more sub-images of a fixed size, wherein adjacent sub-images in the one or more sub-images partially overlap.
US16/351,093 2017-01-16 2019-03-12 Human head detection method, eletronic device and storage medium Abandoned US20190206085A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201710029244.6 2017-01-16
CN201710029244.6A CN106845383B (en) 2017-01-16 2017-01-16 Human head detection method and device
PCT/CN2018/070008 WO2018130104A1 (en) 2017-01-16 2018-01-02 Human head detection method, electronic device and storage medium

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/070008 Continuation WO2018130104A1 (en) 2017-01-16 2018-01-02 Human head detection method, electronic device and storage medium

Publications (1)

Publication Number Publication Date
US20190206085A1 true US20190206085A1 (en) 2019-07-04

Family

ID=59123959

Family Applications (2)

Application Number Title Priority Date Filing Date
US16/299,866 Active 2038-03-13 US10796450B2 (en) 2017-01-16 2019-03-12 Human head detection method, eletronic device and storage medium
US16/351,093 Abandoned US20190206085A1 (en) 2017-01-16 2019-03-12 Human head detection method, eletronic device and storage medium

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US16/299,866 Active 2038-03-13 US10796450B2 (en) 2017-01-16 2019-03-12 Human head detection method, eletronic device and storage medium

Country Status (4)

Country Link
US (2) US10796450B2 (en)
EP (1) EP3570209A4 (en)
CN (1) CN106845383B (en)
WO (1) WO2018130104A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10824907B2 (en) 2017-12-07 2020-11-03 Shanghai United Imaging Healthcare Co., Ltd. Systems and methods for image processing
US11017241B2 (en) * 2018-12-07 2021-05-25 National Chiao Tung University People-flow analysis system and people-flow analysis method
US11048948B2 (en) * 2019-06-10 2021-06-29 City University Of Hong Kong System and method for counting objects
US20220067391A1 (en) * 2020-09-03 2022-03-03 Industrial Technology Research Institute System, method and storage medium for detecting people entering and leaving a field
US20220180098A1 (en) * 2020-12-07 2022-06-09 Canon Kabushiki Kaisha Image processing apparatus, image processing method, and storage medium
US20220319168A1 (en) * 2019-08-07 2022-10-06 Zte Corporation Method for estimating and presenting passenger flow, system, and computer readable storage medium

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845383B (en) 2017-01-16 2023-06-06 腾讯科技(上海)有限公司 Human head detection method and device
US10019654B1 (en) * 2017-06-28 2018-07-10 Accenture Global Solutions Limited Image object recognition
CN107886098A (en) * 2017-10-25 2018-04-06 昆明理工大学 A kind of method of the identification sunspot based on deep learning
CN107832807B (en) * 2017-12-07 2020-08-07 上海联影医疗科技有限公司 Image processing method and system
CN108073898B (en) * 2017-12-08 2022-11-18 腾讯科技(深圳)有限公司 Method, device and equipment for identifying human head area
CN108154110B (en) * 2017-12-22 2022-01-11 任俊芬 Intensive people flow statistical method based on deep learning people head detection
CN108090454A (en) * 2017-12-26 2018-05-29 上海理工大学 Campus bathhouse people flow rate statistical system
CN108345832A (en) * 2017-12-28 2018-07-31 新智数字科技有限公司 A kind of method, apparatus and equipment of Face datection
CN108198191B (en) * 2018-01-02 2019-10-25 武汉斗鱼网络科技有限公司 Image processing method and device
CN108154196B (en) * 2018-01-19 2019-10-22 百度在线网络技术(北京)有限公司 Method and apparatus for exporting image
CN108881740B (en) * 2018-06-28 2021-03-02 Oppo广东移动通信有限公司 Image method and device, electronic equipment and computer readable storage medium
CN109241871A (en) * 2018-08-16 2019-01-18 北京此时此地信息科技有限公司 A kind of public domain stream of people's tracking based on video data
CN109816011B (en) * 2019-01-21 2021-09-07 厦门美图之家科技有限公司 Video key frame extraction method
CN211669666U (en) * 2019-01-29 2020-10-13 王馨悦 Passenger flow counter
US11182903B2 (en) 2019-08-05 2021-11-23 Sony Corporation Image mask generation using a deep neural network
CN110688914A (en) * 2019-09-09 2020-01-14 苏州臻迪智能科技有限公司 Gesture recognition method, intelligent device, storage medium and electronic device
CN111008631B (en) * 2019-12-20 2023-06-16 浙江大华技术股份有限公司 Image association method and device, storage medium and electronic device
CN111680569B (en) * 2020-05-13 2024-04-19 北京中广上洋科技股份有限公司 Attendance rate detection method, device, equipment and storage medium based on image analysis
CN111915779B (en) * 2020-07-31 2022-04-15 浙江大华技术股份有限公司 Gate control method, device, equipment and medium
CN112364716A (en) * 2020-10-23 2021-02-12 岭东核电有限公司 Nuclear power equipment abnormal information detection method and device and computer equipment
CN113011297A (en) * 2021-03-09 2021-06-22 全球能源互联网研究院有限公司 Power equipment detection method, device, equipment and server based on edge cloud cooperation
CN115082836B (en) * 2022-07-23 2022-11-11 深圳神目信息技术有限公司 Behavior recognition-assisted target object detection method and device
US11983337B1 (en) 2022-10-28 2024-05-14 Dell Products L.P. Information handling system mouse with strain sensor for click and continuous analog input
US11983061B1 (en) 2022-10-28 2024-05-14 Dell Products L.P. Information handling system peripheral device sleep power management
US11914800B1 (en) 2022-10-28 2024-02-27 Dell Products L.P. Information handling system stylus with expansion bay and replaceable module
CN115797341B (en) * 2023-01-16 2023-04-14 四川大学 Method for automatically and immediately judging natural head position of skull side position X-ray film
CN116245911B (en) * 2023-02-08 2023-11-03 珠海安联锐视科技股份有限公司 Video offline statistics method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050147292A1 (en) * 2000-03-27 2005-07-07 Microsoft Corporation Pose-invariant face recognition system and process
US20050286753A1 (en) * 2004-06-25 2005-12-29 Triant Technologies Inc. Automated inspection systems and methods
US20140307076A1 (en) * 2013-10-03 2014-10-16 Richard Deutsch Systems and methods for monitoring personal protection equipment and promoting worker safety
US20170351936A1 (en) * 2014-12-17 2017-12-07 Nokia Technologies Oy Object detection with neural network
US20190325605A1 (en) * 2016-12-29 2019-10-24 Zhejiang Dahua Technology Co., Ltd. Systems and methods for detecting objects in images
US10607331B1 (en) * 2019-06-28 2020-03-31 Corning Incorporated Image segmentation into overlapping tiles
US20200110930A1 (en) * 2017-11-13 2020-04-09 Way2Vat Ltd. Systems and methods for neuronal visual-linguistic data retrieval from an imaged document
US20200167593A1 (en) * 2018-11-27 2020-05-28 Raytheon Company Dynamic reconfiguration training computer architecture

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2409028A (en) * 2003-12-11 2005-06-15 Sony Uk Ltd Face detection
US20100014755A1 (en) * 2008-07-21 2010-01-21 Charles Lee Wilson System and method for grid-based image segmentation and matching
US8582807B2 (en) * 2010-03-15 2013-11-12 Nec Laboratories America, Inc. Systems and methods for determining personal characteristics
CN102156863B (en) * 2011-05-16 2012-11-14 天津大学 Cross-camera tracking method for multiple moving targets
CN102902967B (en) * 2012-10-16 2015-03-11 第三眼(天津)生物识别科技有限公司 Method for positioning iris and pupil based on eye structure classification
CN103559478B (en) * 2013-10-07 2018-12-04 唐春晖 Overlook the passenger flow counting and affair analytical method in pedestrian's video monitoring
US9524450B2 (en) * 2015-03-04 2016-12-20 Accenture Global Services Limited Digital image processing using convolutional neural networks
US10074041B2 (en) * 2015-04-17 2018-09-11 Nec Corporation Fine-grained image classification by exploring bipartite-graph labels
CN104922167B (en) 2015-07-16 2018-09-18 袁学军 A kind of Reishi sporule medicinal granules of powder and preparation method thereof
CN104992167B (en) * 2015-07-28 2018-09-11 中国科学院自动化研究所 A kind of method for detecting human face and device based on convolutional neural networks
CN105005774B (en) * 2015-07-28 2019-02-19 中国科学院自动化研究所 A kind of recognition methods of face kinship and device based on convolutional neural networks
CN105374050B (en) * 2015-10-12 2019-10-18 浙江宇视科技有限公司 Motion target tracking restoration methods and device
CN105608690B (en) * 2015-12-05 2018-06-08 陕西师范大学 A kind of image partition method being combined based on graph theory and semi-supervised learning
CN105740758A (en) * 2015-12-31 2016-07-06 上海极链网络科技有限公司 Internet video face recognition method based on deep learning
CN106022237B (en) * 2016-05-13 2019-07-12 电子科技大学 A kind of pedestrian detection method of convolutional neural networks end to end
CN106022295B (en) * 2016-05-31 2019-04-12 北京奇艺世纪科技有限公司 A kind of determination method and device of Data Position
CN106845383B (en) * 2017-01-16 2023-06-06 腾讯科技(上海)有限公司 Human head detection method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050147292A1 (en) * 2000-03-27 2005-07-07 Microsoft Corporation Pose-invariant face recognition system and process
US20050286753A1 (en) * 2004-06-25 2005-12-29 Triant Technologies Inc. Automated inspection systems and methods
US20140307076A1 (en) * 2013-10-03 2014-10-16 Richard Deutsch Systems and methods for monitoring personal protection equipment and promoting worker safety
US20170351936A1 (en) * 2014-12-17 2017-12-07 Nokia Technologies Oy Object detection with neural network
US20190325605A1 (en) * 2016-12-29 2019-10-24 Zhejiang Dahua Technology Co., Ltd. Systems and methods for detecting objects in images
US20200110930A1 (en) * 2017-11-13 2020-04-09 Way2Vat Ltd. Systems and methods for neuronal visual-linguistic data retrieval from an imaged document
US20200167593A1 (en) * 2018-11-27 2020-05-28 Raytheon Company Dynamic reconfiguration training computer architecture
US10607331B1 (en) * 2019-06-28 2020-03-31 Corning Incorporated Image segmentation into overlapping tiles

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10824907B2 (en) 2017-12-07 2020-11-03 Shanghai United Imaging Healthcare Co., Ltd. Systems and methods for image processing
US11416706B2 (en) 2017-12-07 2022-08-16 Shanghai United Imaging Healthcare Co., Ltd. Systems and methods for image processing
US11017241B2 (en) * 2018-12-07 2021-05-25 National Chiao Tung University People-flow analysis system and people-flow analysis method
US11048948B2 (en) * 2019-06-10 2021-06-29 City University Of Hong Kong System and method for counting objects
US20220319168A1 (en) * 2019-08-07 2022-10-06 Zte Corporation Method for estimating and presenting passenger flow, system, and computer readable storage medium
US11816875B2 (en) * 2019-08-07 2023-11-14 Xi'an Zhongxing New Software Co., Ltd. Method for estimating and presenting passenger flow, system, and computer readable storage medium
US20220067391A1 (en) * 2020-09-03 2022-03-03 Industrial Technology Research Institute System, method and storage medium for detecting people entering and leaving a field
US11587325B2 (en) * 2020-09-03 2023-02-21 Industrial Technology Research Institute System, method and storage medium for detecting people entering and leaving a field
US20220180098A1 (en) * 2020-12-07 2022-06-09 Canon Kabushiki Kaisha Image processing apparatus, image processing method, and storage medium

Also Published As

Publication number Publication date
CN106845383B (en) 2023-06-06
EP3570209A1 (en) 2019-11-20
US20190206083A1 (en) 2019-07-04
WO2018130104A1 (en) 2018-07-19
EP3570209A4 (en) 2020-12-23
CN106845383A (en) 2017-06-13
US10796450B2 (en) 2020-10-06

Similar Documents

Publication Publication Date Title
US10796450B2 (en) Human head detection method, eletronic device and storage medium
US10860837B2 (en) Deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition
TWI677826B (en) License plate recognition system and method
US10467458B2 (en) Joint face-detection and head-pose-angle-estimation using small-scale convolutional neural network (CNN) modules for embedded systems
CN107506707B (en) Face detection using small scale convolutional neural network module in embedded system
CN108304798B (en) Street level order event video detection method based on deep learning and motion consistency
RU2730687C1 (en) Stereoscopic pedestrian detection system with two-stream neural network with deep training and methods of application thereof
US20180114071A1 (en) Method for analysing media content
US9128528B2 (en) Image-based real-time gesture recognition
CN113286194A (en) Video processing method and device, electronic equipment and readable storage medium
CN112560796B (en) Human body posture real-time detection method and device, computer equipment and storage medium
CN111914698B (en) Human body segmentation method, segmentation system, electronic equipment and storage medium in image
CN110163211B (en) Image recognition method, device and storage medium
CN108875750B (en) Object detection method, device and system and storage medium
WO2022252642A1 (en) Behavior posture detection method and apparatus based on video image, and device and medium
CN111476710A (en) Video face changing method and system based on mobile platform
CA3136674C (en) Methods and systems for crack detection using a fully convolutional network
Liu et al. Extended faster R-CNN for long distance human detection: Finding pedestrians in UAV images
CN113297956A (en) Gesture recognition method and system based on vision
CN110334568B (en) Track generation and monitoring method, device, equipment and storage medium
Moseva et al. Development of a System for Fixing Road Markings in Real Time
CN111915713A (en) Three-dimensional dynamic scene creating method, computer equipment and storage medium
CN112966556B (en) Moving object detection method and system
CN114332163A (en) High-altitude parabolic detection method and system based on semantic segmentation
Liang et al. Towards better railway service: Passengers counting in railway compartment

Legal Events

Date Code Title Description
AS Assignment

Owner name: TENCENT TECHNOLOGY (SHENZHEN) COMPANY LIMITED, CHI

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:JIANG, DEQIANG;REEL/FRAME:048583/0838

Effective date: 20190308

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION