WO2023077897A1

WO2023077897A1 - Human body detection method and apparatus, electronic device, and computer-readable storage medium

Info

Publication number: WO2023077897A1
Application number: PCT/CN2022/111687
Authority: WO
Inventors: 罗静; 郭宇鹏; 王晓; 毛少将; 雷庆庆
Original assignee: 通号通信信息集团有限公司
Priority date: 2021-11-05
Filing date: 2022-08-11
Publication date: 2023-05-11
Also published as: CN113762221B; CN113762221A

Abstract

The present disclosure relates to a human body detection method and apparatus, an electronic device, and a storage medium. The method comprises: extracting a structural feature of an image to be detected; determining a human body area in said image according to the structural feature; extracting a color feature of the human body area; and determining a human body detection result of said image according to the structural feature and the color feature, the human body detection result comprising a human body frame and key point information of said image.

Description

Human body detection method and device, electronic device, and computer-readable storage medium

technical field

Embodiments of the present disclosure relate to the technical field of artificial intelligence, and in particular, to a human body detection method and device, electronic equipment, and a computer-readable storage medium.

Background technique

In recent years, with the rapid development of artificial intelligence and neural network fields, human gesture recognition technology has been widely used in various application scenarios. Human body posture recognition mainly lies in the research and description of human body posture and prediction of human behavior. The recognition process refers to the process of recognizing human body movements according to the changes in the positions of joint points in the human body in a specified image or video.

Contents of the invention

Embodiments of the present disclosure provide a human body detection method and device, electronic equipment, and a computer-readable storage medium, which can identify a human body simply and accurately.

In a first aspect, an embodiment of the present disclosure provides a human body detection method, including: extracting structural features of an image to be detected, where the structural features are used to characterize structural information of the image to be detected;

Determining the human body region in the image to be detected according to the structural features;

extracting color features of the human body region, where the color features are used to characterize color information of the human body region;

A human body detection result of the image to be detected is determined according to the structural feature and the color feature, and the human body detection result includes human body frame and key point information of the image to be detected.

In a second aspect, an embodiment of the present disclosure provides a human body detection device, including: a first extraction module configured to extract structural features of an image to be detected, where the structural features are used to characterize structural information of the image to be detected;

an area determination module configured to determine the human body area in the image to be detected according to the structural features;

The second extraction module is configured to extract color features of the human body region, where the color features are used to represent color information of the human body region;

The detection module is configured to determine the human body detection result of the image to be detected according to the structural feature and the color feature, and the human body detection result includes human body frame and key point information of the image to be detected.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including a memory, a processor, and a computer program stored on the memory and operable on the processor, and the embodiment of the present disclosure is implemented when the processor executes the computer program Any human detection method.

In a fourth aspect, an embodiment of the present disclosure provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, any human body detection method in the embodiments of the present disclosure is implemented.

In the embodiment of the present disclosure, the structural features of the image to be detected are extracted; according to the structural features, the human body area in the image to be detected is determined; the color features of the human body area are extracted; and the human body detection result of the image to be detected is determined according to the structural features and color features , the human body detection result includes the body frame and key point information of the image to be detected. This method uses structural features and color features to detect human bodies together, so that high-accuracy human body detection results can be obtained, and the human body detection model corresponding to this method can be trained using images with human body frames, without using key point coordinates Marked images are used for training to avoid manual labeling of key point coordinates.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments with reference to the accompanying drawings.

Description of drawings

FIG. 1 is a flow chart of a human body detection method provided by an embodiment of the present disclosure.

FIG. 2 is a flow chart of a method for extracting structural features provided by an embodiment of the present disclosure.

FIG. 3 is a flowchart of a human body detection model training method provided by an embodiment of the present disclosure.

Fig. 4 is a schematic diagram of a training process of a human body detection model provided by an embodiment of the present disclosure.

Fig. 5 is a block diagram of a human body detection device provided by an embodiment of the present disclosure.

FIG. 6 is a block diagram of an electronic device used to implement the human body detection method of the embodiment of the present disclosure.

Detailed ways

The present disclosure will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present disclosure, but not to limit the present disclosure. In addition, it should be noted that, for the convenience of description, only some structures related to the present disclosure are shown in the drawings but not all structures.

Action recognition algorithms have always been one of the main scenarios of artificial intelligence applications, such as fall recognition, fight detection, climbing detection, etc. The core of this type of algorithm includes key point (or key bone point) detection and action classification. Among them, the accuracy of action classification depends on the accuracy of keypoint detection. In related technologies, the mainstream key point detection methods include OpenPose, MoveNet, etc., which all belong to the detection method of regressing key point coordinates according to features.

Among them, OpenPose is an open source library for human body posture recognition developed by Carnegie Mellon University based on convolutional neural network and supervised learning, and using the convolutional neural network framework (Convolutional Architecture for Fast Feature Embedding, CAFFE), which can realize human body movements. , facial expression, finger movement and other pose estimation, suitable for single and multiple people, with excellent robustness, is the world's first real-time multi-person 2D pose estimation application based on deep learning. MoveNet is a model launched by Google that can detect human posture, including two derivative versions of Lightning and Thunder. The former is suitable for critical applications that are sensitive to delay, while the latter focuses on improving the accuracy of recognition at the expense of effectiveness.

Although the above key point detection method has high recognition accuracy in most experimental scenarios, in actual application scenarios, due to the complexity of the situation, the recognition effect is likely to be unsatisfactory. For example, in public places such as high-speed rail stations and subway stations, the accuracy of key point detection results is not high due to dense crowds and severe occlusion.

In related technologies, the human body detection algorithm usually uses a large amount of training data marked with coordinates of key points to train the initial human body detection model. After obtaining the trained human body detection model, input the picture to be detected into the human body detection model, and the model has After processing, the human body detection result is output, and the human body detection result includes the key point coordinates of the picture to be detected. In the above method, the training of the human detection model relies on the dataset marked with keypoint coordinates. Usually, the number of coordinates that need to be marked with key points in the picture is large, and the similarity of different human body parts is high, which makes the coordinates of key points difficult to mark, resulting in a lot of time and manpower.

In view of this, the embodiments of the present disclosure provide a human body detection method and device, and its corresponding human body detection model can be trained using sample images marked with human body frames, which is better than using sample images marked with key point coordinates for training. The operation complexity is effectively reduced, saving time and labor costs.

In a first aspect, an embodiment of the present disclosure provides a human body detection method.

The human body detection method in the embodiments of the present disclosure can be executed by a corresponding human body detection device, which can be implemented in software and/or hardware, and can generally be integrated into electronic equipment.

FIG. 1 is a flow chart of a human body detection method provided by an embodiment of the present disclosure. Referring to Fig. 1, the human body detection method of the embodiment of the present disclosure includes:

Step S101, extracting structural features of the image to be detected.

Among them, the structural features are used to characterize the structural information of the image to be detected. For example, when the image to be detected includes a human body, its structural features may be structural features such as the head, elbow, joint, and wrist of the human body. In some specific implementations, structural features can be extracted from the image to be detected by convolution.

In some embodiments, the structural features of the image to be detected include first structural features and second structural features, and the step of extracting the structural features of the image to be detected includes:

Firstly, feature extraction is performed on the image to be detected based on the preset first convolution kernel to obtain the first structural feature; secondly, based on the preset second convolution kernel, feature extraction is performed on the first structural feature to obtain the second structural feature.

Among them, the convolution kernel can be regarded as a filter matrix, which is used to extract features from the convolved image. In this embodiment, the second structural feature is a feature obtained by further convolution of the first structural feature, which is a feature with a higher level (or feature scale) and a more global feature than the first structural feature . The feature levels corresponding to the first structural feature and the second structural feature are related to the convolution kernel size and convolution step size used when extracting the feature.

For example, the first structural feature includes eye features, nose features, and mouth features of a person in the image to be detected, and correspondingly, the second structural feature may be a face feature. For another example, the first structural feature includes head features, elbow features, hand features, and leg features of a person in the image to be detected, and correspondingly, the second structural feature may be the overall structural feature of the person.

It should be noted that the above-mentioned first structural feature and second structural feature are only examples, and can be flexibly set according to actual needs, which is not limited in the present disclosure.

It should also be noted that when performing feature extraction in related technologies, multiple features extracted at the same scale are usually spliced to obtain a complete feature map, which is then fused with feature maps of other scales. The difference is that in the embodiments of the present disclosure, for feature maps of the same scale, features are first superimposed according to convolution kernel clusters/feature types, and after enhanced features are obtained, operations such as feature splicing and feature fusion are performed.

Exemplarily, the first convolution kernel corresponds to the same or similar feature extraction scale, and the first convolution kernel includes multiple convolution kernel clusters, and the convolution kernels in the same convolution kernel cluster are used to extract the same specific Structural features, and the third structural features extracted by the same convolution kernel cluster are superimposed to obtain the final first structural features. In other words, in the embodiment of the present disclosure, feature superposition is performed on features of the same scale in units of feature types, so as to enhance the expression of features. For example, the first convolution kernel is a convolution kernel used to extract local structural features of the face, and the local structural features of the face include eye features, nose features and mouth features, wherein the first convolution kernel cluster is used to extract Eye features, the second convolution kernel cluster is used to extract nose features, and the third convolution kernel is used to extract mouth features. After obtaining the above features, multiple eye features extracted from the first convolution kernel cluster Perform superposition to obtain the final eye features. Similarly, superimpose the multiple nose features extracted by the second convolution kernel cluster to obtain the final nose features, and combine the multiple mouth features extracted by the third convolution kernel cluster The features are superimposed to obtain the final mouth features. That is to say, when acquiring the first structural feature, the embodiments of the present disclosure do not fuse or combine features of different scales, but superimpose multiple features of the same scale and type to obtain features with enhanced effects.

Step S102, according to the structural features, determine the human body area in the image to be detected.

Wherein, the image to be detected includes a foreground area and a background area. In the human detection application scenario, the foreground area specifically refers to the human body area, and the background area refers to the area composed of objects, items, etc. other than the human body area. In the process of human body detection, we do not pay much attention to the characteristics of the background area. Therefore, it is necessary to remove the background area from the image to be detected, or extract the foreground area from the image to be detected, so as to further analyze and process the foreground area. .

In some embodiments, the step of determining the human body area in the image to be detected according to the structural feature includes: returning the structural feature to the image to be detected, and determining the human body area in the image to be detected.

In some implementations, the structural features include a first structural feature and a second structural feature. The first structural feature and the second structural feature are returned to the image to be detected to obtain a first human body region corresponding to the first structural feature and a second human body region corresponding to the second structural feature.

With regard to the above-mentioned embodiment, description will be made by taking a head feature whose structural feature is a human body as an example. If the head feature is directly returned to the image to be detected, its area in the image to be detected is usually a regular rectangular area, and the head of the human body is located in the rectangular area. In other words, the region determined by direct regression of structural features includes both the head region and part of the background region, so the head region cannot be accurately framed from the image to be detected. Based on this, before returning the structural features to the image to be detected, the structural features are firstly filtered to filter out the background structural features to obtain the filtered structural features. When the filtered structural features are returned to the image to be detected, the regression result including only the human body area can be obtained, so as to realize the accurate frame selection of the human body structure and provide a regional basis for the subsequent extraction of color features.

In some embodiments, according to the structural features, the step of determining the human body region in the image to be detected includes:

First, the structural features are filtered according to the preset structural feature threshold to obtain the filtered structural features, wherein the structural feature threshold is used to filter the background structural features in the structural features; secondly, the filtered structural features are returned to the image to be detected, Determine the human body region in the image to be detected. Wherein, the structural feature threshold may be obtained according to experience, statistical data or through training, which is not limited in the present disclosure.

In some implementations, the structural feature threshold includes a first structural feature threshold and a second structural feature threshold. According to the structural features, the step of determining the human body area in the image to be detected includes:

First, the first structural feature is filtered according to the first structural feature threshold to obtain the first filtered structural feature; the first filtered structural feature is returned to the image to be detected to obtain the first human body region. Secondly, the second structural feature is filtered according to the second structural feature threshold to obtain the second filtered structural feature; the second filtered structural feature is returned to the image to be detected to obtain the second human body region.

It should be noted that since different structural feature thresholds are set for different structural features, the filtered structural features obtained based on the structural feature thresholds are more accurate and reasonable, thereby obtaining more accurate human body regions.

It should also be noted that algorithms such as linear regression, K-nearest neighbor (K-Nearest Neighbor, K-NN) regression, decision tree regression and random forest regression can be used when returning structural features to the image to be detected. This is not limited.

Step S103, extracting color features of the human body region.

Among them, the color feature is used to represent the color information of the human body area. For example, the color feature is a feature based on the grayscale of an image, and for another example, the color feature is a feature based on an RGB (Red, Green, Blue, red, green, blue) color channel.

It should be noted that the above color features are only examples, and the present disclosure does not limit them.

In some embodiments, the human body region includes a first human body region and a second human body region, wherein the first human body region is the region obtained by returning the first filtered structural feature to the image to be detected, and the second human body region is the second human body region The filtered structural features are regressed to the regions obtained in the image to be detected. The steps of extracting the color features of the human body region include:

Firstly, the color feature of the first human body area is extracted to obtain the first color feature; secondly, the color feature of the second human body area is extracted to obtain the second color feature.

In some other embodiments, the step of extracting the color feature of the human body region includes: firstly, extracting the color feature of the first human body region to obtain the first color feature; secondly, performing convolution again on the basis of the first color feature, Extract second color features. In other words, in this embodiment, instead of using the second structural feature to obtain the second color feature, the first color feature is further convoluted to obtain the second color feature. Therefore, the second color feature is different from the first color feature In other words, it is a higher-level feature and a more global feature.

Step S104, according to the structure feature and color feature, determine the human body detection result of the image to be detected.

Wherein, the human body detection result includes human body frame and key point information of the image to be detected. The human body frame is represented as a rectangle or a square, which represents the area range of the human body in the image. The key point information includes the coordinates of the key points of the human body. In some specific implementations, the key points correspond to 17 parts of the human body, namely the nose, left and right eyes, left and right ears, left and right shoulders, left and right elbows, left and right wrists, left and right hips, left and right knees and Left and right ankles.

In some embodiments, according to the structural features and color features, the step of determining the human body detection result of the image to be detected includes:

First, connect the first structural feature and the first color feature to obtain the first connection feature; secondly, connect the second structural feature and the second color feature to obtain the second connection feature; finally, based on the preset activation function, the first connection The feature and the second connection feature are activated to obtain the human body detection result of the image to be detected.

Among them, the structure feature and the color feature can be connected through the connection (Concat) function. Activation functions include, but are not limited to, Sigmoid, Tanh, and ReLU.

It should be noted that, in some embodiments, after determining the human body detection result of the image to be detected, the method further includes:

First, determine the human body posture information according to the human body frame and key point information of the image to be detected; secondly, send an early warning signal when a preset early warning event occurs according to the human body posture information.

For example, in stations, carriages or other public places, human body detection is performed on surveillance video to obtain human body detection results, and when the human body posture is determined to be a falling posture according to the human body frame and key point information, it is known that a person falls. , can send an early warning signal to the staff terminal or broadcast terminal, so that the relevant staff can carry out emergency treatment or start the emergency plan in time.

In this embodiment, extract the structural features of the image to be detected; determine the human body area in the image to be detected according to the structural features; extract the color features of the human body area; determine the human body detection result of the image to be detected according to the structural features and color features, The result of human detection includes human frame and key point information of the image to be detected. This method uses structural features and color features to detect human bodies together, so that high-accuracy human body detection results can be obtained, and the human body detection model corresponding to this method can be trained using images with human body frames, without using key point coordinates Marked images are used for training to avoid manual labeling of key point coordinates.

Fig. 2 is a flow chart of a method for extracting structural features provided by an embodiment of the present disclosure. As shown in Figure 2, the structural feature extraction method includes the following steps:

In step S201, feature extraction is performed on the image to be detected through multiple convolution kernel clusters to obtain a third structural feature.

Wherein, the first convolution kernel includes multiple convolution kernel clusters, and each convolution kernel cluster includes at least one convolution kernel. The third structural feature is the structural feature obtained by feature extraction of the image to be detected by each convolution kernel, which is equal to the number of convolution kernels (in the case of a single channel).

In some embodiments, in order to fully and accurately extract structural features from the image to be detected, multiple first convolution kernels are set. These convolution kernels are divided into multiple convolution kernel clusters through a clustering operation, and each convolution kernel cluster includes at least one convolution kernel. For each convolution kernel cluster, the convolution kernels belonging to the convolution kernel cluster have high similarity, which is specifically manifested in that these convolution kernels have better extraction effects when extracting a certain type of structural features.

Step S202, superimposing the third structural features corresponding to the same convolution kernel cluster to obtain the structural features corresponding to the convolution kernel cluster.

Wherein, the first structural feature includes structural features corresponding to multiple convolution kernel clusters.

In some embodiments, the third structural features extracted by the convolution kernels of the same convolution kernel cluster are superimposed to obtain the structural features corresponding to the convolution kernel cluster.

Since the third structural feature corresponding to the same convolution kernel cluster performs better for certain structural features, through the above superposition operation, the structural features can be enhanced, thereby obtaining better structural features.

Step S203, performing feature extraction on the first structural feature based on the preset second convolution kernel to obtain the second structural feature.

For example, the first convolution kernel includes 100 convolution kernels, and these convolution kernels are divided into the first convolution kernel cluster, the second convolution kernel cluster and the third convolution kernel cluster, wherein, belonging to the first volume The convolution kernel of the convolution kernel cluster has a better extraction effect when extracting elbow features, the convolution kernel belonging to the second convolution kernel cluster has a better extraction effect when extracting wrist features, and the convolution kernel belonging to the third convolution kernel cluster has a better extraction effect. The convolution kernel has a better extraction effect when extracting head features. When using the aforementioned 100 convolution kernels to perform feature extraction on the image to be detected, 100 third structural features are obtained.

For the first convolution kernel cluster, the third structural features extracted by the convolution kernels belonging to it are superimposed to obtain the first structural features corresponding to the first convolution kernel cluster; for the second convolution kernel cluster, the belonging The third structural feature extracted by the convolution kernel is superimposed to obtain the first structural feature corresponding to the second convolution kernel cluster; for the third convolution kernel cluster, the first convolution kernel extracted by the convolution kernel belonging to it is The three structural features are superimposed to obtain the first structural feature corresponding to the third convolution kernel cluster.

In this embodiment, since the first convolution kernel cluster has a better effect in extracting elbow features, the first structural feature corresponding to the first convolution kernel cluster can better reflect the elbow feature than the single third structural feature. internal features. Similarly, the first structural feature corresponding to the second convolution kernel cluster can better reflect the wrist feature, and the first structural feature corresponding to the third convolution kernel cluster can better reflect the head feature.

It should be noted that, in some specific implementations, the human body detection method provided by the embodiments of the present disclosure may be implemented by a preset human body detection model. Among them, the human body detection model includes a model constructed based on a neural network.

In some embodiments, before extracting the structural features of the image to be detected, it also includes:

The human body detection model is trained through a preset training set, wherein the training set includes sample images and human frame annotation information of the sample images.

In related technologies, the training set used for training the human detection model includes sample images and key point coordinate labeling information of the sample images. Through model training, the human body detection model can learn the key point coordinate labeling ability, so as to mark the key point coordinates for the image to be detected. However, the key point coordinate labeling of sample images usually relies on manual labeling, which is complex and takes a lot of time and manpower.

In the embodiments of the present disclosure, model training is performed using a training set including sample images and human body frame labeling information of the sample images, and the model learns human body frame labeling capabilities during the training process. The human body detection model realizes human body frame labeling, which relies on the recognition and extraction of features in the image. When the recognized and extracted features are more accurate, the obtained human body frame labeling is more accurate. Correspondingly, the key points determined based on this feature The coordinates are more accurate. In other words, the human body detection model provided by the embodiments of the present disclosure can be trained by using the sample image and the body frame annotation information of the sample image as the training set, and can also obtain the key point coordinate labeling ability, and there is no need to perform key point coordinates on the sample image. Marking simplifies the operation complexity and saves a lot of time and manpower.

Step S301, input the training set into the initial human body detection model, and extract the detailed features of the sample image through the first convolutional network.

Wherein, the training set includes the sample image and the human frame annotation information of the sample image. In other words, the sample images used for training the human body detection model are images labeled with body frames. The first convolutional network includes a plurality of convolutional layers (for example, 3 convolutional layers), which are used to extract low-level structural features (ie, detail features), and the detail features include texture features and the like.

Step S302, extracting the first structural feature of the sample image through the second convolutional network, and filtering the first structural feature according to the first structural feature threshold to obtain the first filtered structural feature.

Step S303, returning the first filter structure feature to the sample image, determining the first human body region, extracting the color features of the first human body region, and obtaining the first color feature.

Step S304, using the third convolutional network to perform global feature extraction on the first structural feature to obtain the second structural feature, and filter the second structural feature according to the second structural feature threshold to obtain the second filtered structural feature.

Step S305, returning the second filter structure feature to the sample image, determining the second human body area, extracting the color feature of the second human body area, and obtaining the second color feature.

Step S306, connecting the first structural feature and the first color feature through the connection layer to obtain the first connection feature, and connecting the second structural feature and the second color feature through the connection layer to obtain the second connection feature.

Step S307, input the first connection feature and the second connection feature into the activation layer, and obtain the human body detection result of the sample image through activation processing.

Wherein, the human body detection result includes the body frame and key point information of the sample image.

Step S308, adjust the parameters of the human body detection model according to the human body detection result, and use the adjusted human body detection model to perform iterative training until the preset stop condition is met, then stop the model training.

Wherein, the stop condition may be a condition related to detection accuracy and/or training times, which is not limited in the present disclosure. The human body detection model obtained after the training is stopped is regarded as a model that meets the requirements, and human body detection can be performed based on the human body detection model.

It should be noted that, in some embodiments, after multiple times of training, the detection accuracy of the trained model can also be improved by clustering convolution kernels. specifically:

First, the convolution kernels of the second convolutional network are clustered to obtain convolution kernel clusters. Optionally, in some embodiments, only the convolution kernels corresponding to the first convolution layer of the second convolutional network are clustered, so that similar convolution kernels are clustered into the same convolution kernel cluster. Secondly, the first structural features corresponding to the same convolution kernel cluster are superimposed to obtain the enhanced first structural features corresponding to the convolution kernel cluster, and the first color features are extracted using the enhanced first structural features. Again, the enhanced first structural feature is used to determine the second structural feature, and the second color feature is extracted according to the second structural feature. Then, through the connection layer, the first structural feature and the first color feature are connected to obtain the first connection feature, and the second structural feature and the second color feature are connected to obtain the second connection feature. Finally, the first connection feature and the second connection feature are input into the activation layer, and the human detection result of the sample image is obtained through activation processing, and the parameters of the human detection model are adjusted according to the human detection result, and the adjusted human detection model is used again. Iterative training, when the preset stop condition is met, the model training is stopped, and a trained human detection model is obtained. Among them, among the convolution kernel clusters obtained by clustering, there are some convolution kernel clusters, which are less effective in extracting structural features (for example, convolution kernel clusters that mainly extract noise features), for this part of convolution kernel clusters, It can be filtered out to improve the accuracy of structural feature extraction.

Through convolution kernel clustering and convolution kernel filtering operations, model parameters can be reduced, and the generalization ability of the model can be improved, so that the human body detection model can obtain good human body detection results for different application scenarios and different types of pictures.

Fig. 4 is a schematic diagram of a training process of a human body detection model provided by an embodiment of the present disclosure. As shown in Figure 4, the training set includes multiple sample images, and the sample images are marked with human frame. After the training set is input into the human body detection model, the detailed features of the sample image are first extracted through convolution operation, and the first structural features are further extracted on the basis of the detailed features, and according to the first structural feature and the threshold of the first structural feature, from the sample image Extract the first color feature. On the basis of the first structural feature, the second structural feature is further extracted, and the global color feature is extracted from the sample image according to the second structural feature and the second structural feature threshold.

Through the connection layer, the first structural feature and the first color feature are connected to form the first connection feature, and the second structural feature and the second color feature are connected to form the second connection feature, and then the first connection feature and the second The connection features are input to the activation layer, processed by the preset activation function, and the training results are obtained.

It should be noted that after several times of training, the convolution kernels for extracting the first structural features can be clustered to obtain convolution kernel clusters. The convolution kernels belonging to the same convolution kernel cluster have high similarity, which is specifically manifested in that these convolution kernels have better extraction effects when extracting a certain type of structural features. Therefore, the first structural features extracted by the convolution kernels in the same convolution kernel cluster can be superimposed to obtain the first structural features with feature enhancement effect, so as to extract the second structural features with higher accuracy based on the first structural features. Structure.

In these convolution kernel clusters, there may be some convolution kernel clusters that are not effective in extracting structural features (for example, convolution kernel clusters that mainly extract noise features). In order to improve the extraction effect of structural features, this part of convolution kernel clusters can be The filter cluster is filtered out, and the function of the filter convolution kernel is similar to the "model pruning" operation.

The step division of the above various methods is only for the sake of clarity of description. During implementation, they can be combined into one step or some steps can be split and decomposed into multiple steps. As long as they include the same logical relationship, they are all within the protection scope of the present disclosure ; Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process are all within the scope of protection of this application.

The second aspect of the present disclosure provides a human body detection device. Fig. 5 is a block diagram of a human body detection device provided by an embodiment of the present disclosure. As shown in Figure 5, the human body detection device 500 includes:

The first extraction module 501 is configured to extract structural features of the image to be detected, and the structural features are used to characterize structural information of the image to be detected.

Among them, the structural features are used to characterize the structural information of the image to be detected. For example, when the image to be detected includes a human body, its structural features may be structural features such as the head, elbow, joint, and foot of the human body. In some specific implementations, structural features can be extracted from the image to be detected by convolution.

In some embodiments, the structural features of the image to be detected include first structural features and second structural features, and the first extraction module 501 includes a first extraction unit and a second extraction unit. Among them, the first extraction unit is used to perform feature extraction on the image to be detected based on the preset first convolution kernel to obtain the first structural features; the second extraction unit is used to extract the first structure based on the preset second convolution kernel. Features feature extraction to obtain the second structural features.

Among them, the convolution kernel can be regarded as a filter matrix, which is used to extract features from the convolved image. In this embodiment, the second structural feature is a feature obtained by further convoluting the first structural feature, which is a higher-level feature and more global than the first structural feature.

For example, the first structural feature includes eye features, nose features, and mouth features, and the second structural feature is a human face feature. For another example, the first structural feature includes the head feature, elbow feature, hand feature, leg feature and foot feature of a certain person in the image to be detected, and the second structural feature is the overall structural feature of the person.

The area determining module 502 is configured to determine the human body area in the image to be detected according to the structural features.

In some embodiments, the area determination module 502 includes a regression unit. Wherein, the regression unit is used for regressing structural features into the image to be detected, and determining the human body area in the image to be detected.

In some implementations, the structural features include a first structural feature and a second structural feature. Based on the regression unit, the first structural feature and the second structural feature are respectively regressed into the image to be detected to obtain a first human body region corresponding to the first structural feature and a second human body region corresponding to the second structural feature.

In some embodiments, the area determining module 502 further includes a filtering unit. Wherein, the filtering unit is used to filter the structural features according to the preset structural feature threshold to obtain the filtered structural features, wherein the structural feature threshold is used to filter the background structural features in the structural features; the regression unit is also used to filter The structural features are returned to the image to be detected, and the human body area in the image to be detected is determined. Wherein, the structural feature threshold may be obtained according to experience, statistical data or through training, which is not limited in the present disclosure.

In some implementations, the structural feature threshold includes a first structural feature threshold and a second structural feature threshold. The filtering unit is specifically used to filter the first structural feature according to the first structural feature threshold to obtain the first filtered structural feature; the regression unit is specifically used to return the first filtered structural feature to the image to be detected to obtain the first filtered structural feature. body area. The filtering unit is specifically used to filter the second structural feature according to the second structural feature threshold to obtain the second filtered structural feature; the regression unit is specifically used to return the second filtered structural feature to the image to be detected to obtain the second filtered structural feature. body area.

It should also be noted that algorithms such as linear regression, K-nearest neighbor regression, decision tree regression, and random forest regression may be used when regressing structural features to the image to be detected, which is not limited in the present disclosure.

The second extraction module 503 is configured to extract color features of the human body region, and the color features are used to represent color information of the human body region.

Among them, the color feature is used to represent the color information of the human body area. For example, the color feature is a feature based on image grayscale, and for another example, the color feature is a feature based on RGB color channels.

In some embodiments, the human body region includes a first human body region and a second human body region, wherein the first human body region is the region obtained by returning the first filtered structural feature to the image to be detected, and the second human body region is the second human body region The filtered structural features are regressed to the regions obtained in the image to be detected. The second extraction module 503 includes a third extraction unit and a fourth extraction unit. Wherein, the third extraction unit is used to extract the color features of the first human body area to obtain the first color features; the fourth extraction unit is used to extract the color features of the second human body area to obtain the second color features.

In some other embodiments, the second extraction module 503 includes a third extraction unit and a fifth extraction unit. Wherein, the third extraction unit is used to extract the color feature of the first human body area to obtain the first color feature; the fifth extraction unit is used to perform convolution again on the basis of the first color feature to extract the second color feature. In other words, in this embodiment, instead of using the second structural feature to obtain the second color feature, the first color feature is further convoluted to obtain the second color feature. Therefore, the second color feature is different from the first color feature In other words, it is a higher-level feature and a more global feature.

The detection module 504 is configured to determine the human body detection result of the image to be detected according to the structure feature and the color feature, and the human body detection result includes the body frame and key point information of the image to be detected.

In some embodiments, the detection module 504 includes a connection unit and an activation unit. Wherein, the connection unit is used to connect the first structural feature and the first color feature to obtain the first connection feature; the connection unit is also used to connect the second structural feature and the second color feature to obtain the second connection feature; the activation unit, It is used for activating the first connection feature and the second connection feature based on a preset activation function to obtain a human body detection result of the image to be detected.

Among them, the structural features and color features can be connected by the Concat function. Activation functions include, but are not limited to, Sigmoid functions, Tanh functions, and ReLU functions.

It should be noted that, in some embodiments, the human body detection device further includes an early warning module. The early warning model includes an attitude determining unit and an early warning signal transmitting unit. Specifically, the posture determination unit is used to determine the human body posture information according to the human body frame and key point information of the image to be detected after determining the human body detection result of the image to be detected; In the event of a preset early warning event, an early warning signal is issued.

For example, in stations, carriages or public places, human body detection is performed on surveillance video to obtain human body detection results, and when the human body posture is determined to be a falling posture according to the human body frame and key point information, it is known that a person falls. Therefore, Early warning signals can be sent to staff terminals or broadcast terminals, so that relevant staff can carry out emergency treatment or start emergency plans in time.

It should also be noted that the human body detection device disclosed in this embodiment can be deployed or run in physical servers, virtual servers and various electronic terminals, which is not limited in the present disclosure.

In this embodiment, the structural features of the image to be detected are extracted by the first extraction module; the area determination module determines the human body area in the image to be detected according to the structural features; the second extraction module extracts the color features of the human body area; and through detection The module determines the human body detection result of the image to be detected according to the structural features and color features. The device uses structural features and color features to perform human body detection together, and can obtain human body detection results with high accuracy, and the human body detection model corresponding to the device can be obtained by training using images with human body frames, without using key point coordinate markers The images are trained to avoid manual labeling of key point coordinates.

Fig. 6 is a block diagram of an electronic device provided by an embodiment of the present disclosure.

Referring to FIG. 6 , an embodiment of the present disclosure provides an electronic device, which includes: at least one processor 601; at least one memory 602, and one or more I/O interfaces 603 connected between the processor 601 and the memory 602 Among them; wherein, the memory 602 stores one or more computer programs that can be executed by at least one processor 601, and one or more computer programs are executed by at least one processor 601, so that at least one processor 601 can perform the above-mentioned human detection methods.

An embodiment of the present disclosure also provides a computer-readable storage medium on which a computer program is stored, wherein the computer program implements the above-mentioned human detection method when executed by a processor/processing core. Computer readable storage media may be volatile or nonvolatile computer readable storage media.

An embodiment of the present disclosure also provides a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in a processor of an electronic device When running in the electronic device, the processor in the electronic device executes the above human body detection method.

Those of ordinary skill in the art can understand that all or some of the steps in the methods disclosed above, the functional modules/units in the system, and the device can be implemented as software, firmware, hardware, and an appropriate combination thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical components. Components cooperate to execute. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit . Such software may be distributed on computer readable storage media, which may include computer storage media (or non-transitory media) and communication media (or transitory media).

As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile media implemented in any method or technology for storage of information, such as computer readable program instructions, data structures, program modules, or other data. volatile, removable and non-removable media. Computer storage media include, but are not limited to, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM), static random access memory (SRAM), flash memory or other memory technologies, portable Compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical disk storage, magnetic cartridge, magnetic tape, magnetic disk storage or other magnetic storage device, or any other device that can be used to store desired information and can be accessed by a computer any other medium. In addition, as is well known to those of ordinary skill in the art, communication media typically embodies computer-readable program instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery medium.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

Computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or Source or object code written in any combination, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as the “C” language or similar programming languages. Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as via the Internet using an Internet service provider). connect). In some embodiments, an electronic circuit, such as a programmable logic circuit, field programmable gate array (FPGA), or programmable logic array (PLA), can be customized by utilizing state information of computer-readable program instructions, which can Various aspects of the present disclosure are implemented by executing computer readable program instructions.

The computer program products described here can be specifically realized by means of hardware, software or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) etc. wait.

Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It should be understood that each block of the flowcharts and/or block diagrams, and combinations of blocks in the flowcharts and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processor of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.

It is also possible to load computer-readable program instructions into a computer, other programmable data processing device, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing device, or other equipment to produce a computer-implemented process , so that instructions executed on computers, other programmable data processing devices, or other devices implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, a portion of a program segment, or an instruction that includes one or more Executable instructions. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.

Example embodiments have been disclosed herein, and while specific terms have been employed, they are used and should be construed in a generic descriptive sense only and not for purposes of limitation. In some instances, it will be apparent to those skilled in the art that features, characteristics and/or elements described in connection with a particular embodiment may be used alone, or may be described in combination with other embodiments, unless explicitly stated otherwise. Combinations of features and/or elements. Accordingly, it will be understood by those of ordinary skill in the art that various changes in form and details may be made without departing from the scope of the present disclosure as set forth in the appended claims.

Claims

A human detection method, characterized in that, comprising:

Extracting structural features of the image to be detected, the structural features are used to characterize the structural information of the image to be detected;

Determining the human body region in the image to be detected according to the structural features;

extracting color features of the human body region, where the color features are used to characterize color information of the human body region;

A human body detection result of the image to be detected is determined according to the structural feature and the color feature, and the human body detection result includes human body frame and key point information of the image to be detected.
The human body detection method according to claim 1, wherein the structural features include a first structural feature and a second structural feature;

The structural features of the extracted image to be detected include:

performing feature extraction on the image to be detected based on a preset first convolution kernel to obtain the first structural feature;

Feature extraction is performed on the first structural feature based on a preset second convolution kernel to obtain the second structural feature.
The human body detection method according to claim 2, wherein the first convolution kernel includes a plurality of convolution kernel clusters, and the convolution kernel cluster includes at least one convolution kernel,

The feature extraction of the image to be detected based on the preset first convolution kernel to obtain the first structural feature includes:

performing feature extraction on the image to be detected through the plurality of convolution kernel clusters to obtain a third structural feature;

The third structural feature corresponding to the same convolution kernel cluster is superimposed to obtain the structural feature corresponding to the convolution kernel cluster, wherein the first structural feature includes structural features corresponding to multiple convolution kernel clusters.
The human body detection method according to claim 1, wherein said determining the human body area in the image to be detected according to the structural features comprises:

Filtering the structural features according to a preset structural feature threshold to obtain filtered structural features, wherein the structural feature threshold is used to filter background structural features in the structural features;

Returning the filtered structural feature to the image to be detected, and determining the human body region in the image to be detected.
The human body detection method according to claim 4, wherein the structural feature threshold comprises a first structural feature threshold and a second structural feature threshold, and the filtering structural feature comprises a first filtering structural feature and a second filtering structural feature ;

The filtering of the structural features according to the preset structural feature threshold to obtain the filtered structural features includes:

Filtering the first structural feature according to the first structural feature threshold to obtain the first filtered structural feature;

Filtering is performed on the second structural feature according to the second structural feature threshold to obtain a second filtered structural feature.
The human body detection method according to claim 5, wherein the human body area includes a first human body area and a second human body area, and the first human body area returns the first filtering structure feature to the waiting area. Detecting the region obtained in the image, the second human body region is the region obtained by returning the second filter structure feature to the image to be detected;

The extraction of the color features of the human body region includes:

extracting color features of the first human body region to obtain a first color feature;

Extracting color features of the second human body region to obtain second color features.
The human body detection method according to claim 6, wherein the determining the human body detection result of the image to be detected according to the structural features and the color features includes:

connecting the first structural feature and the first color feature to obtain a first connection feature;

connecting the second structural feature and the second color feature to obtain a second connection feature;

Activation processing is performed on the first connection feature and the second connection feature based on a preset activation function to obtain a human body detection result of the image to be detected.
The human body detection method according to any one of claims 1-7, characterized in that, the human body detection method is realized by a preset human body detection model.
The human body detection method according to claim 8, wherein, before extracting the structural features of the image to be detected, further comprising:

The human body detection model is trained through a preset training set, wherein the training set includes sample images and human body frame annotation information of the sample images.
A human detection device, characterized in that it comprises:

The first extraction module is configured to extract structural features of the image to be detected, and the structural features are used to characterize the structural information of the image to be detected;

an area determination module configured to determine the human body area in the image to be detected according to the structural features;

The second extraction module is configured to extract color features of the human body region, where the color features are used to represent color information of the human body region;

The detection module is configured to determine the human body detection result of the image to be detected according to the structural feature and the color feature, and the human body detection result includes human body frame and key point information of the image to be detected.
An electronic device, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, characterized in that, when the processor executes the computer program, any one of claims 1-9 is implemented. The human detection method.
A computer-readable storage medium, on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the human body detection method according to any one of claims 1-9 is implemented.
A computer program product, comprising computer readable codes, or a non-volatile computer readable storage medium bearing computer readable codes, wherein when the computer readable codes are run in a processor of an electronic device, the The processor in the electronic device is used to implement the human body detection method described in any one of claims 1-9.