CN110826469A

CN110826469A - Person detection method and device and computer readable storage medium

Info

Publication number: CN110826469A
Application number: CN201911059463.4A
Authority: CN
Inventors: 贾玉虎
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-11-01
Filing date: 2019-11-01
Publication date: 2020-02-21
Anticipated expiration: 2039-11-01
Also published as: CN110826469B

Abstract

A person detection method, apparatus, and computer-readable storage medium detect an age group in which a person object in an input image is located based on an age detection model, and detect a person category and a person position corresponding to the person object in the input image based on a target detection model; when matching of the person category determined based on the age group with the person category determined based on the target detection model is successful, a person detection result is generated based on the person category and the person position. Through the implementation of the scheme, the figure detection is realized by integrating the apparent detection result of the age detection and the global detection result of the target detection, the phenomena of false detection and missing detection of the figure detection by a single model are effectively avoided, and the accuracy of the detection result is improved.

Description

Person detection method and device and computer readable storage medium

Technical Field

The present application relates to the field of electronic technologies, and in particular, to a person detection method and apparatus, and a computer-readable storage medium.

Background

With the rapid development of computer technology, people detection technology is applied more and more, for example, people in an image can be detected to perform target tracking or shooting focusing.

Currently, when people are detected, people's age is estimated by using human face features, and then the detection of people types (infants, teenagers, middle-aged and elderly people) in a specific age stage is realized. However, the apparent age of the face cannot absolutely and truly reflect the actual age of the observed person, so that when the face features are low in significance and easily confused, false detection and missing detection are easily caused when the current age estimation method is adopted for detecting the person, and the accuracy of person detection is limited.

Disclosure of Invention

The embodiment of the application provides a person detection method, a person detection device and a computer readable storage medium, which can at least solve the problems of high false detection rate and omission factor and limited detection result accuracy when an age estimation method is adopted for person detection in the related art.

A first aspect of an embodiment of the present application provides a human detection method, including:

determining a first input image which needs to be input into a trained age detection model based on an image to be detected, and determining a second input image which needs to be input into a trained target detection model based on the image to be detected;

detecting an age layer in which a human subject is located in the first input image based on the age detection model, and detecting a human category and a human position corresponding to the human subject in the second input image based on the target detection model; the character categories are correspondingly divided according to different age layers;

when matching of the person category determined based on the age group with the person category determined based on the target detection model is successful, a person detection result is generated based on the person category and the person position.

A second aspect of the embodiments of the present application provides a person detection apparatus, including:

the determining module is used for determining a first input image which needs to be input into a trained age detection model based on an image to be detected and determining a second input image which needs to be input into a trained target detection model based on the image to be detected;

a detection module for detecting an age layer in which the human subject is located in the first input image based on the age detection model, and detecting a human category and a human position corresponding to the human subject in the second input image based on the target detection model; the character categories are correspondingly divided according to different age layers;

and the generating module is used for generating a person detection result based on the person category and the person position when the person category determined based on the age group is successfully matched with the person category determined based on the target detection model.

A third aspect of embodiments of the present application provides an electronic apparatus, including: the system comprises a memory, a processor and a bus, wherein the bus is used for realizing the connection and communication between the memory and the processor; the processor is configured to execute the computer program stored in the memory, and when the processor executes the computer program, the processor implements the steps of the human detection method provided by the first aspect of the embodiment of the present application.

A fourth aspect of the present embodiment provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the human detection method provided in the first aspect of the present embodiment.

As can be seen from the above, according to the person detection method, apparatus, and computer-readable storage medium provided in the present application, an age group where a person object in an input image is located is detected based on an age detection model, and a person type and a person position corresponding to the person object in the input image are detected based on a target detection model; when matching of the person category determined based on the age group with the person category determined based on the target detection model is successful, a person detection result is generated based on the person category and the person position. Through the implementation of the scheme, the figure detection is realized by integrating the apparent detection result of the age detection and the global detection result of the target detection, the phenomena of false detection and missing detection of the figure detection by a single model are effectively avoided, and the accuracy of the detection result is improved.

Drawings

Fig. 1 is a schematic basic flowchart of a human detection method according to a first embodiment of the present application;

fig. 2 is a schematic flowchart of an age detection method according to a first embodiment of the present application;

fig. 3 is a schematic structural diagram of an age detection model according to a first embodiment of the present application;

fig. 4 is a schematic flowchart of a target detection method according to a first embodiment of the present application;

FIG. 5 is a schematic diagram of an architecture of a target detection model according to a first embodiment of the present application;

fig. 6 is a schematic flow chart of a feature extraction method according to a first embodiment of the present application;

fig. 7 is a detailed flowchart of a human detection method according to a second embodiment of the present application;

fig. 8 is a schematic diagram illustrating program modules of a human detection apparatus according to a third embodiment of the present application;

fig. 9 is a schematic diagram illustrating program modules of another human detection apparatus according to a third embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application.

Detailed Description

In order to make the objects, features and advantages of the present invention more apparent and understandable, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to overcome the defects of high false detection rate, high missed detection rate and limited detection result accuracy in the detection of a person by using an age estimation method in the related art, a first embodiment of the present application provides a person detection method, for example, fig. 1 is a basic flow chart of the person detection method provided in this embodiment, and the person detection method includes the following steps:

step 101, determining a first input image required to be input into a trained age detection model based on an image to be detected, and determining a second input image required to be input into a trained target detection model based on the image to be detected.

Specifically, in consideration of the limitation of a single age detection model in performing human detection, the present embodiment employs an integrated model including an age detection model and a target detection model to realize human detection. It should be noted that the image to be detected is also an original image that needs to be detected, the inputs of the two models of this embodiment are determined based on the image to be detected, in one implementation, the inputs of the two models may be the image to be detected, and in another implementation, the inputs may also be an image obtained by processing the image to be detected.

In an optional implementation manner of this embodiment, determining a first input image required to be input to the trained age detection model based on the image to be detected, and determining a second input image required to be input to the trained target detection model based on the image to be detected includes: determining a face image of a person object in an image to be detected as a first input image which needs to be input into a trained age detection model, and determining a global image of the image to be detected as a second input image which needs to be input into a trained target detection model.

Specifically, in this embodiment, since the age detection is implemented based on face detection, the input of the age detection model is only the face image of the person object in the image to be detected, and the input of the target detection model is the entire image to be detected. In other embodiments, the input of the age detection model may also be the image to be detected, and the face image is extracted through the age detection model.

In an optional implementation manner of this embodiment, before determining a first input image required to be input to the trained age detection model based on the image to be detected, and determining a second input image required to be input to the trained target detection model based on the image to be detected, the method further includes: judging whether the image to be detected meets a preset person detection triggering condition or not; and when the person detection triggering condition is met, determining a first input image which needs to be input into the trained age detection model based on the image to be detected, and determining a second input image which needs to be input into the trained target detection model based on the image to be detected.

Specifically, the embodiment triggers the execution of the human detection process according to the present invention only in a suitable scene, wherein the manner of determining whether the human detection triggering condition is satisfied includes, but is not limited to, the following two manners:

the method comprises the steps of judging whether the image to be detected comprises a face image or not.

In practical application, only when the image to be detected comprises the face image, it is indicated that a person appears in the image to be detected, and at the moment, the corresponding person detection requirement is met, so that the waste of the terminal processing performance caused by blindly triggering the person detection process of the embodiment can be effectively avoided.

And judging whether the face feature significance degree is lower than a preset degree threshold value.

In practical applications, the less significant the facial features, the easier the facial features are to be confused, for example, the facial features of a baby aged 0 to 1 year are similar to the facial features of a baby aged 2 to 3 years, which is easy to be confused. The embodiment extracts a face image of a person object in an image to be detected, determines the face feature significance degree of the face image, and then triggers the person detection process of the embodiment when the face feature significance degree is lower than a preset degree threshold. The person detection process of the embodiment is executed at a low face feature significance level, the accuracy of person detection can be guaranteed, and when the face feature significance level is high, the accuracy of age detection can be effectively guaranteed to a greater extent by explaining the face feature, so that in some implementation modes, only the image to be detected is input into the age detection model of the embodiment to carry out person detection of a single model, the accuracy of person detection is guaranteed, meanwhile, the detection efficiency is improved, and the processing power consumption is reduced.

Step 102, detecting an age layer where the human object is located in the first input image based on the age detection model, and detecting a human category and a human position corresponding to the human object in the second input image based on the target detection model.

Specifically, the person categories of the present embodiment are divided according to different age groups, for example, the person category corresponding to the age group of 0 to 1 year is an infant, the person category corresponding to the age group of 1 to 6 years is an infant, the person category corresponding to the age group of 7 to 12 years is a juvenile, and the person category corresponding to the age group of 13 to 17 years is an adolescent. In addition, it should be understood that the position of the person in this embodiment may be the position of the whole person, may also be the position of a part (for example, a face) of the person, and may also be a plurality of pieces of position information including both the whole position and the part position of the person.

As shown in fig. 2, which is a schematic flow chart of an age detection method provided in this embodiment, in an optional implementation manner of this embodiment, an age detection model includes a first network branch and a second network branch, where the first network branch and the second network branch are parallel heterogeneous networks with the same number of network parameters and different activation functions from a feature extraction layer, so that when detecting an age layer of a human object in a first input image based on the age detection model, the method specifically includes the following steps:

step 201, inputting a first input image to a first network branch and a second network branch simultaneously;

step 202, merging the characteristic outputs of the network segments of the first network branch and the second network branch respectively to obtain segment outputs corresponding to the network segments respectively;

and 203, performing dynamic soft interval regression based on all the segmented outputs to obtain the age layer of the human object in the first input image.

Specifically, the network architecture of the network model adopted by the current age estimation method is too complex, and the model parameters are huge, so that the person detection efficiency is low, and the power consumption is high. Based on this, the age detection model of the present embodiment is provided, in the present embodiment, each of the first network branch and the second network branch includes a plurality of network segments for performing feature extraction of different age groups, and the segment output includes: an interval prediction distribution, i.e., a probability of an interval corresponding to a segment, an interval translation coefficient, and an interval width scaling coefficient. In addition, it should be noted that the training data of the face detection model of the present embodiment can be set to four groups, i.e., 0 to 1 year old, 1 to 3 years old, 3 to 10 years old, and more than or equal to 10 years old, and through the finer classification, the model is beneficial to learning finer features, and the effect is greatly better than that of the two-classification. Fig. 3 is a schematic diagram of an age detection model provided in this embodiment, in which a dashed box a is a first network branch, a dashed box B is a second network branch, and the age detection model enables two network segments, each of which performs feature fusion and is output in two directions, where the output in two directions of the first network segment is a1 and B1, a1 is a section width scaling coefficient corresponding to the first network segment, B1 includes a section prediction distribution and a section translation coefficient corresponding to the first network segment, a2 is a section width scaling coefficient corresponding to the second network segment, B2 includes a section prediction distribution and a section translation coefficient corresponding to the second network segment, and a pb (prediction block) block in fig. 3 represents a prediction module.

By the age detection model of the embodiment, the age prediction is converted from a regression problem into a multi-classification problem, a double-flow network with a plurality of segments is adopted, the network design is very compact, and the double-flow network is two parallel heterogeneous networks and is used for extracting heterogeneous characteristics. Each network branch is segmented in multiple layers, a strategy from rough segmentation to fine segmentation is adopted, each segment executes partial age classification, the task amount is small, a number of needed neurons are few, fewer parameters are generated, the model is compact, the model is light in weight, and the embedded device such as a mobile phone is easier to deploy. It should be appreciated that since the uniform division of the age interval into non-overlapping intervals is not flexible enough to deal with age group imbalance and age continuity, the present embodiment employs soft classification, introduces dynamic range, allows each interval to be panned and scaled, and the panning and scaling parameters employ adaptive values that are related to the input.

In addition, different features can be extracted from the two network branches of the age detection model through different activation functions and pooling modes, the richness of fusion features is improved, in each segment, the features from the two network branches are sent to a fusion block, and the fusion block is responsible for generating segment output. In this embodiment, the merging manner of the two network branches may be performed by a dot product method (which may also be referred to as element-by-element multiplication), that is, corresponding elements in the two vectors are multiplied respectively. It should also be noted that the network uses different activation functions to ensure that there is a correct range of output results, e.g., the RELU penalty function is used to ensure that the distribution probability is positive, and the Tanh penalty function is used to ensure that the translation and scaling coefficient outputs are confined to the range of [ -1,1 ]. The net final loss function is the average absolute error of the predicted output and the true value. It should be further noted that, in practical applications, the number of channels of the first network branch of the embodiment may be 32, and the number of channels of the second network branch may be 16.

Further, in an optional implementation manner of this embodiment, before the first input image is simultaneously input to the first network branch and the second network branch, the method further includes: acquiring coarse-grained characteristic information of a human object in a first input image; configuring a plurality of network segments of the first network branch and the second network branch based on the coarse-grained characteristic information.

Specifically, in this embodiment, the character object in the first input image may be subjected to rough feature extraction in advance, and then, the rough extracted feature is used as a reference to perform real-time configuration of the network segmentation, so that the network segmentation in the current network better conforms to the current actual application scene, and the method is more suitable for performing feature extraction on the current input image, so as to achieve compromise between the richness of the extracted features and the feature extraction efficiency.

As shown in fig. 4, which is a schematic flow chart of a target detection method provided in this embodiment, in an optional implementation manner of this embodiment, the target detection model includes a feature extraction network and a target detection network, so that when detecting a person category and a person position corresponding to a person object in a second input image based on the target detection model, the method specifically includes the following steps:

step 401, inputting a second input image into a feature extraction network to perform feature extraction layer by layer to obtain a multi-channel feature map;

step 402, extracting the features of the multi-channel feature map on different scales through a target detection network to obtain feature information of different scales;

and step 403, predicting the people category and the people position corresponding to the people object in the second input image by adopting the prior frames with different scales and length-width ratios through the target detection network based on the extracted feature information with different scales.

Specifically, as shown in fig. 5, a is an architecture diagram of the object detection model provided in this embodiment, where a is a feature extraction network, b is an object detection network, and the final output includes the predicted character category and the character position. The embodiment acquires the characteristic information from the characteristic images with different scales, and further predicts the position and the type of the target, so that the network is sensitive to large and small objects on the input image at the same time. It should be noted that the target detection network of this embodiment is a one-stage target detector, and has the characteristics of being fast and accurate, on one hand, it extracts feature maps of different scales by using different convolutional layers for detection, the large-scale feature map (the feature map closer to the front) can be used to detect small objects, and the small-scale feature map (the feature map closer to the rear) is used to detect large objects; on the other hand, it employs a priori boxes of different dimensions and aspect ratios.

In the prediction link of the target detection model in this embodiment, for each prior frame, the category (the one with the largest confidence) and the confidence value are determined according to the category confidence, and the prior frame belonging to the background is filtered out. The lower-threshold prior boxes are then filtered out according to a confidence threshold (e.g., 0.5). And decoding the remaining prior frame, and obtaining the real position parameter of the prior frame according to the prior frame. After decoding, it is generally necessary to sort the data in descending order according to confidence, and then only the top k (e.g., 400) prior boxes are retained. And finally, carrying out a non-maximum suppression algorithm, filtering out prior frames with larger overlapping degree, wherein the last residual prior frame is the final prediction result.

As shown in fig. 6, which is a schematic flow diagram of a feature extraction method provided in this embodiment, in an optional implementation manner of this embodiment, the feature extraction network includes a deep convolutional network and a point-by-point convolutional network, so that when a second input image is input to the feature extraction network to perform feature extraction layer by layer, and a multi-channel feature map is obtained, the method specifically includes the following steps:

step 601, inputting a second input image into a deep convolution network, and performing feature extraction on M input channels through the deep convolution network;

step 602, stacking the extracted M single-channel feature maps directly by a deep convolutional network to obtain an M channel feature map;

and 603, performing spatial linear mapping on the M channel characteristic diagram through N convolution cores of the point-by-point convolution network to obtain a multi-channel characteristic diagram.

Specifically, the multi-channel feature map of this embodiment is an N-channel feature map in which an M-channel feature map is mapped from an M-dimensional space to an N-dimensional space. In this embodiment, the basic unit of the feature extraction network is a depth-level separable convolution network, and the operation performed by the feature extraction network is a decomposable convolution operation, which includes two refinement operations in practical application: the depth convolution operation is different from the standard convolution operation, for the standard convolution operation, convolution kernels are used on all input channels, while the depth convolution operation of the embodiment adopts different convolution kernels for each input channel, namely each convolution kernel corresponds to one input channel, and features are extracted layer by layer, so that repeated features are prevented from being extracted from an input image, and the point-by-point convolution operation adopts a convolution kernel of 1 × 1 to combine the output of the depth convolution operation. The overall feature extraction effect of the combination of the deep convolution and the point-by-point convolution provided by the embodiment can be equal to a standard convolution, but the complexity of a network architecture, the model parameter number and the calculation amount are greatly reduced, the power consumption of feature extraction can be effectively reduced, and the feature extraction method is easier to deploy on an embedded device such as a mobile phone.

And 103, when the person category determined based on the age group is successfully matched with the person category determined based on the target detection model, generating a person detection result based on the person category and the person position.

Specifically, the person category correspondence of the embodiment is divided based on different age groups, so that one person category can be determined by the output correspondence of the age detection model, and then the person categories determined by the age detection model and the target detection model are compared to judge whether the two are the same. Because the age detection is performed through the apparent detection of the local features of the target, namely the human face features, and the target detection is performed through the global features of the target, the error easily caused by single model detection can be effectively avoided.

In addition, when the person types determined based on the age detection model and the target detection model are the same, the embodiment shows that the detection results are effective, so that the person types and the person positions can be output as the detection results, the phenomena of false detection and missing detection of person detection performed by a single model are effectively avoided, and the accuracy of the detection results is improved. It should be further noted that the present embodiment may preferably perform detection on the person category of the infant scene, and in addition, the detection result output by the present embodiment may be provided to other terminal applications for post-processing, such as focusing, coloring, blurring, and the like, which is beneficial to optimization and improvement of terminal photographing.

According to the technical solution provided by the embodiment of the present application, an age group where a person object in an input image is located is detected based on an age detection model, and a person category and a person position corresponding to the person object in the input image are detected based on a target detection model; when matching of the person category determined based on the age group with the person category determined based on the target detection model is successful, a person detection result is generated based on the person category and the person position. Through the implementation of the scheme, the figure detection is realized by integrating the apparent detection result of the age detection and the global detection result of the target detection, the phenomena of false detection and missing detection of the figure detection by a single model are effectively avoided, and the accuracy of the detection result is improved.

The method in fig. 7 is a detailed human detection method provided in the second embodiment of the present application, and the human detection method includes:

step 701, a face image of a human object in an image to be detected and a global image of the image to be detected are respectively used as input images of an age detection model and a target detection model which are trained.

Specifically, in consideration of the limitation of a single age detection model in performing human detection, the present embodiment employs an integrated model including an age detection model and a target detection model to realize human detection. It should be noted that the image to be detected is also the original image that needs to be subjected to human detection.

Step 702, inputting the face image to the first network branch and the second network branch simultaneously, and extracting features of different age groups through a plurality of network segments.

Specifically, the first network branch and the second network branch of this embodiment are parallel heterogeneous networks with the same number of network parameters and different activation functions and feature extraction layers, and are used for extracting heterogeneous features; each network branch is segmented in multiple layers, and each segment performs partial age classification by adopting a strategy from rough classification to fine classification.

And 703, fusing the characteristic outputs of the network segments of the first network branch and the second network branch respectively to obtain segment outputs corresponding to the network segments respectively.

Specifically, the segmented output in this embodiment includes: an interval prediction distribution, an interval translation coefficient, and an interval width scaling coefficient.

And step 704, performing dynamic soft interval regression based on all the segmented outputs to obtain the age layer of the person object corresponding to the face image.

Specifically, since the unified division of the age interval into non-overlapping intervals is not flexible enough in dealing with the imbalance of the age groups and the continuity of the ages, the present embodiment employs soft classification, introduces dynamic range, allows each interval to be translated and scaled, and employs adaptive values related to the input for the translation and scaling parameters.

Step 705, inputting the global image of the image to be detected into a deep convolution network, performing feature extraction on the M input channels through the deep convolution network, and directly stacking the extracted M single-channel feature maps to obtain an M-channel feature map.

Specifically, in the deep convolution operation of this embodiment, different convolution kernels are used for each input channel, that is, each convolution kernel corresponds to one input channel, and features are extracted layer by layer, so that extraction of repetitive features from an input image is avoided.

And step 706, performing spatial linear mapping on the M-channel characteristic diagram through the N convolution cores of the point-by-point convolution network to obtain a multi-channel characteristic diagram.

Specifically, the point-by-point convolution operation in this embodiment is to combine the outputs of the deep convolution operations by using a1 × 1 convolution kernel, and the obtained multi-channel feature map is an N-channel feature map in which an M-channel feature map is mapped from an M-dimensional space to an N-dimensional space.

And 707, extracting features of the multi-channel feature map on different scales through a target detection network, and predicting the character types and the character positions corresponding to the character objects in the global image by adopting the prior frames with different scales and length-width ratios based on the extracted feature information of the different scales.

In the embodiment, the feature maps of different scales are extracted by adopting different convolutional layers, and the priori frames of different scales and length-width ratios are used for prediction, so that the network is sensitive to large and small objects on the input image at the same time, and the method has the characteristics of rapidness and accuracy.

Step 708, determine whether the person category determined based on the age group matches the person category determined based on the object detection model.

And 709, when the matching is successful, generating a person detection result based on the person type and the person position.

The embodiment shows that the detection results are effective when the people types determined based on the age detection model and the target detection model are the same, so that the people types and the people positions can be output as the detection results, the phenomena of false detection and missing detection of people detection by a single model are effectively avoided, and the accuracy of the detection results is improved.

It should be understood that, the size of the serial number of each step in this embodiment does not mean the execution sequence of the step, and the execution sequence of each step should be determined by its function and inherent logic, and should not be limited uniquely to the implementation process of the embodiment of the present application.

The embodiment of the application discloses a person detection method, which detects the age layer of a person object in an input image based on an age detection model, and detects the person type and the person position corresponding to the person object in the input image based on a target detection model; judging whether the person category determined based on the age group matches the person category determined based on the target detection model; and when the matching is successful, generating a human detection result based on the human category and the human position. Through the implementation of the scheme, the figure detection is realized by integrating the apparent detection result of the age detection and the global detection result of the target detection, the phenomena of false detection and missing detection of the figure detection by a single model are effectively avoided, the accuracy of the detection result is improved, the network architecture of the age detection model and the target detection model is simple, the model parameters are less, the model lightweight is effectively realized, and the realizability of the figure detection on embedded equipment such as a mobile phone is improved.

Fig. 8 is a human detection apparatus according to a third embodiment of the present application. The human detection apparatus can be used to implement the human detection method in the foregoing embodiments. As shown in fig. 8, the human detection device mainly includes:

a determining module 801, configured to determine, based on an image to be detected, a first input image that needs to be input to a trained age detection model, and determine, based on the image to be detected, a second input image that needs to be input to a trained target detection model;

a detection module 802 for detecting an age layer where the person object is located in the first input image based on an age detection model, and detecting a person category and a person position corresponding to the person object in the second input image based on a target detection model; wherein, the character categories are correspondingly divided according to different age groups;

a generating module 803, configured to generate a person detection result based on the person category and the person position when matching between the person category determined based on the age group and the person category determined based on the target detection model is successful.

In an optional implementation manner of this embodiment, the determining module 801 is specifically configured to: determining a face image of a person object in an image to be detected as a first input image which needs to be input into a trained age detection model, and determining a global image of the image to be detected as a second input image which needs to be input into a trained target detection model.

As shown in fig. 9, another human detection apparatus provided in this embodiment is an optional implementation manner of this embodiment, the human detection apparatus further includes: the extracting module 804 is configured to extract a face image of a person object in the image to be detected, and determine a face feature significance degree of the face image. Correspondingly, when the degree of significance of the face features is lower than a preset degree-below threshold, the determining module 801 executes its function correspondingly.

In an optional implementation manner of this embodiment, the age detection model includes a first network branch and a second network branch, where the first network branch and the second network branch are parallel heterogeneous networks with the same number of network parameters and different activation functions and feature extraction layers. The detection module 802 is specifically configured to, when detecting an age layer of a human subject in a first input image based on an age detection model, simultaneously input the first input image to a first network branch and a second network branch, where the first network branch and the second network branch each include a plurality of network segments respectively used for performing feature extraction of different age layers; respectively fusing the characteristic outputs of the network segments of the first network branch and the second network branch to obtain segment outputs respectively corresponding to the network segments, wherein the segment outputs comprise: interval prediction distribution, an interval translation coefficient and an interval width scaling coefficient; and performing dynamic soft inter-region regression based on all segmented outputs to obtain the age layer of the human object in the first input image.

Referring to fig. 9 again, in an optional implementation manner of this embodiment, the human detection device further includes: a configuration module 805, configured to obtain coarse-grained feature information of a human object in a first input image before the first input image is simultaneously input to the first network branch and the second network branch; configuring a plurality of network segments of the first network branch and the second network branch based on the coarse-grained characteristic information.

In an optional implementation manner of this embodiment, the target detection model includes a feature extraction network and a target detection network. The detecting module 802 is specifically configured to, when detecting the age layer of the human object in the first input image based on the age detection model: inputting the second input image into a feature extraction network to perform feature extraction layer by layer to obtain a multi-channel feature map; and extracting the features of the multi-channel feature map on different scales through a target detection network, and predicting the character category and the character position corresponding to the character object in the second input image by adopting a priori frames with different scales and length-width ratios based on the extracted feature information of different scales.

Further, in an optional implementation manner of this embodiment, when the detection module 802 inputs the second input image into the feature extraction network to perform feature extraction layer by layer, so as to obtain a multi-channel feature map, specifically configured to: inputting the second input image into a deep convolution network, performing feature extraction on M input channels through the deep convolution network, and directly stacking the extracted M single-channel feature graphs to obtain an M-channel feature graph; and performing spatial linear mapping on the M-channel characteristic diagram through N convolution cores of the point-by-point convolution network to obtain a multi-channel characteristic diagram, wherein the multi-channel characteristic diagram is an N-channel characteristic diagram of the M-channel characteristic diagram mapped from an M-dimensional space to an N-dimensional space.

It should be noted that, the person detection methods in the first and second embodiments can be implemented based on the person detection device provided in this embodiment, and persons skilled in the art can clearly understand that, for convenience and simplicity of description, the specific working process of the person detection device described in this embodiment may refer to the corresponding process in the foregoing method embodiment, and details are not described here.

According to the person detection apparatus provided in the present embodiment, the age layer where the person object is located in the input image is detected based on the age detection model, and the person category and the person position corresponding to the person object in the input image are detected based on the target detection model; when matching of the person category determined based on the age group with the person category determined based on the target detection model is successful, a person detection result is generated based on the person category and the person position. Through the implementation of the scheme, the figure detection is realized by integrating the apparent detection result of the age detection and the global detection result of the target detection, the phenomena of false detection and missing detection of the figure detection by a single model are effectively avoided, and the accuracy of the detection result is improved.

Referring to fig. 10, fig. 10 is an electronic device according to a fourth embodiment of the present disclosure. The electronic device can be used for realizing the human detection method in the foregoing embodiment. As shown in fig. 10, the electronic device mainly includes:

a memory 1001, a processor 1002, a bus 1003 and a computer program stored on the memory 1001 and executable on the processor 1002, the memory 1001 and the processor 1002 being connected by the bus 1003. The processor 1002, when executing the computer program, implements the person detection method in the foregoing embodiments. Wherein the number of processors may be one or more.

The Memory 1001 may be a high-speed Random Access Memory (RAM) Memory or a non-volatile Memory (e.g., a disk Memory). The memory 1001 is used for storing executable program code, and the processor 1002 is coupled to the memory 1001.

Further, an embodiment of the present application also provides a computer-readable storage medium, where the computer-readable storage medium may be provided in an electronic device in the foregoing embodiments, and the computer-readable storage medium may be the memory in the foregoing embodiment shown in fig. 10.

The computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the person detection method in the foregoing embodiments. Further, the computer-readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a RAM, a magnetic disk, or an optical disk.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of modules is merely a division of logical functions, and an actual implementation may have another division, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a readable storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method of the embodiments of the present application. And the aforementioned readable storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In view of the above description of the person detection method, apparatus and computer-readable storage medium provided by the present application, those skilled in the art will recognize that the scope of the present application can be modified according to the following claims.

Claims

1. A human detection method, comprising:

2. The person detection method according to claim 1, wherein the determining of the first input image required to be input to the trained age detection model based on the image to be detected and the determining of the second input image required to be input to the trained target detection model based on the image to be detected comprise:

determining a face image of a person object in an image to be detected as a first input image which needs to be input into a trained age detection model, and determining a global image of the image to be detected as a second input image which needs to be input into a trained target detection model.

3. The person detection method according to claim 1, wherein before determining a first input image required to be input to the trained age detection model based on the image to be detected and determining a second input image required to be input to the trained target detection model based on the image to be detected, further comprising:

extracting a face image of a person object in the image to be detected, and determining the face characteristic significance degree of the face image;

and when the human face feature significance degree is lower than a preset degree threshold value, the steps of determining a first input image which needs to be input into a trained age detection model based on the image to be detected and determining a second input image which needs to be input into a trained target detection model based on the image to be detected are executed.

4. The person detection method according to any one of claims 1 to 3, wherein the age detection model includes a first network branch and a second network branch, and the first network branch and the second network branch are parallel heterogeneous networks having the same number of network parameters and different activation functions from a feature extraction layer;

the detecting an age layer in which a human subject is located in the first input image based on the age detection model comprises:

inputting the first input image to the first network branch and a second network branch simultaneously; wherein the first network branch and the second network branch each comprise a plurality of network segments for performing feature extraction of different age groups, respectively;

respectively fusing the characteristic outputs of the network segments of the first network branch and the second network branch to obtain segment outputs respectively corresponding to the network segments; wherein the segmented output comprises: interval prediction distribution, an interval translation coefficient and an interval width scaling coefficient;

and performing dynamic soft inter-region regression based on all the segmented outputs to obtain the age layer of the human object in the first input image.

5. The person detection method according to claim 4, wherein before the input of the first input image to the first network branch and the second network branch simultaneously, the method further comprises:

acquiring coarse-grained characteristic information of a human object in the first input image;

configuring the plurality of network segments of the first and second network branches based on the coarse-grained feature information.

6. The human detection method according to any one of claims 1 to 3, wherein the object detection model includes a feature extraction network and an object detection network;

the detecting a person category and a person position corresponding to a person object in the second input image based on the object detection model comprises:

inputting the second input image into the feature extraction network to perform feature extraction layer by layer to obtain a multi-channel feature map;

and extracting features of the multi-channel feature map on different scales through the target detection network, and predicting the character category and the character position corresponding to the character object in the second input image by adopting a priori frame with different scales and length-width ratios based on the extracted feature information of different scales.

7. The human detection method according to claim 6, wherein the feature extraction network includes a deep convolutional network and a point-by-point convolutional network;

inputting the second input image into the feature extraction network to perform feature extraction layer by layer to obtain a multi-channel feature map, wherein the method comprises the following steps:

inputting the second input image into the deep convolution network, performing feature extraction on M input channels through the deep convolution network, and directly stacking the extracted M single-channel feature maps to obtain an M-channel feature map;

performing spatial linear mapping on the M channel characteristic diagram through N convolution cores of the point-by-point convolution network to obtain a multi-channel characteristic diagram; wherein the multi-channel feature map is an N-channel feature map in which the M-channel feature map is mapped from an M-dimensional space to an N-dimensional space.

8. A human detection apparatus, comprising:

9. An electronic device, comprising: the system comprises a memory, a processor and a bus, wherein the bus is used for realizing connection communication between the memory and the processor; the processor is configured to execute a computer program stored on the memory, and when the processor executes the computer program, the processor implements the steps of the method of any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.