CN113515990A

CN113515990A - Image processing and crowd density estimation method, device and storage medium

Info

Publication number: CN113515990A
Application number: CN202011044576.XA
Authority: CN
Inventors: 颜肇义
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2021-10-19

Abstract

The embodiment of the application provides an image processing and crowd density estimation method, equipment and a storage medium. In the embodiment of the application, on one hand, subject identification is performed on an image to be processed to determine a subject region of the image to be processed; on the other hand, the main body density of the image to be processed is estimated to obtain a density map of the image to be processed; and then, filtering the density map based on the main body area of the image to be processed to obtain the filtered density map, so that the background noise of the density map is reduced, and the accuracy of main body density estimation and the accuracy of subsequent main body counting according to the density map are improved.

Description

Image processing and crowd density estimation method, device and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image processing method, an image processing apparatus, a crowd density estimation method, and a storage medium.

Background

In recent years, large-scale group events such as festival celebrations, concerts, sporting events and the like are increasingly frequent, and group emergencies caused by dense crowds become the focus of society. The crowd counting is used as an important mode for crowd control and management, the statistics can be carried out on the crowd under the current scene, the resource allocation is assisted, the occurrence of an emergency can be planned, and the safety of public places is enhanced.

In the prior art, the population density maps are often summed directly to the final population. However, due to the influence of image background noise, a portion of the density map corresponding to the background area also has a strong response value, and further, a large error exists in the population counting, and the accuracy is low.

Disclosure of Invention

Aspects of the present disclosure provide a method, apparatus, and storage medium for image processing and crowd density estimation to reduce background noise of a subject density map, thereby facilitating an improvement in accuracy of subsequent subject counting.

An embodiment of the present application provides an image processing method, including:

acquiring an image to be processed;

carrying out main body density estimation on the image to be processed to obtain a first density map of the image to be processed;

performing subject identification on the image to be processed to determine a subject region of the image to be processed;

and filtering the first density map based on the main body area of the image to be processed to obtain a second density map of the image to be processed.

The embodiment of the present application further provides a crowd density estimation method, including:

acquiring an image to be processed;

performing crowd density estimation on the image to be processed to obtain a first crowd density map;

performing human head recognition on the image to be processed to determine a human head area of the image to be processed;

and filtering the first crowd density map based on the head area of the image to be processed to obtain a second crowd density map.

An embodiment of the present application further provides an image processing method, including:

responding to an image processing request event, and acquiring an image to be processed;

An embodiment of the present application further provides an electronic device, including: a memory and a processor; wherein the memory is used for storing a computer program;

the processor is coupled to the memory for executing the computer program for performing the steps in the above-described image processing method and/or crowd density estimation method.

Embodiments of the present application also provide a computer-readable storage medium storing computer instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the above-mentioned image processing method and/or crowd density estimation method.

In the embodiment of the application, on one hand, subject identification is performed on an image to be processed to determine a subject region of the image to be processed; on the other hand, the main body density of the image to be processed is estimated to obtain a density map of the image to be processed; and then, filtering the density map based on the main body area of the image to be processed to obtain the filtered density map, so that the background noise of the density map is reduced, and the accuracy of main body density estimation and the accuracy of subsequent main body counting according to the density map are improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1a is a schematic flowchart of an image processing method according to an embodiment of the present application;

FIG. 1b is a schematic diagram of an image processing process provided in an embodiment of the present application;

fig. 1c is a schematic diagram of an imaging position relationship under a camera collecting view angle according to an embodiment of the present application;

FIG. 1d is a schematic diagram of a model training process provided in the embodiments of the present application;

FIG. 1e is a schematic diagram of a model training process provided in the present application;

fig. 2a is a schematic flowchart of another image processing method according to an embodiment of the present disclosure;

fig. 2b is a schematic flowchart of a crowd density estimation method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to solve the above technical problem, in some embodiments of the present application, on one hand, subject identification is performed on an image to be processed to determine a subject region of the image to be processed; on the other hand, the main body density of the image to be processed is estimated to obtain a density map of the image to be processed; and then, filtering the density map based on the main body area of the image to be processed to obtain the filtered density map, so that the background noise of the density map is reduced, and the accuracy of main body density estimation and the accuracy of subsequent main body counting according to the density map are improved.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

It should be noted that: like reference numerals refer to like objects in the following figures and embodiments, and thus, once an object is defined in one figure or embodiment, further discussion thereof is not required in subsequent figures and embodiments.

Fig. 1a is a schematic flowchart of an image processing method according to an embodiment of the present application. As shown in fig. 1a, the method comprises:

101. and acquiring an image to be processed.

102. And carrying out main body density estimation on the image to be processed to obtain a first density map of the image to be processed.

103. And performing subject identification on the image to be processed to determine a subject area of the image to be processed.

104. And filtering the first density map based on the main body area of the image to be processed to obtain a second density map of the image to be processed.

The image processing method provided by the embodiment can be executed by an image capturing device, an electronic device with an image capturing function, or a computer device. The image acquisition equipment can be a camera, a camera and the like; the electronic device having the image capturing function may be an autonomous mobile device or a terminal device equipped with a camera. The autonomous mobile equipment can be a robot, an unmanned vehicle or an unmanned aerial vehicle and the like which can move autonomously; the terminal equipment can be a computer, a smart phone or wearable equipment and the like.

The computer equipment can be terminal equipment such as a computer, a smart phone and the like, and can also be server-side equipment. The server device may be a single server device, a cloud server array, or a Virtual Machine (VM) running in the cloud server array.

In the embodiment of the present application, regardless of the device that is the main subject of the image processing method, in step 101, the image to be processed may be acquired. For the electronic equipment with the image acquisition function, the image can be acquired through the image acquisition equipment, and the image acquired by the image acquisition equipment is used as an image to be processed. For the computer device, the computer device can communicate with the image acquisition device and receive the image sent by the image acquisition device as the image to be processed. Alternatively, the computer apparatus may also read an image from a storage medium as an image to be processed. The storage medium may be a hard disk fixedly installed on the computer device, a cloud storage, or an external storage medium such as a usb disk.

In order to obtain the subject density of the physical region of the to-be-processed image acquisition, in step 102, subject density estimation may be performed on the to-be-processed image, a density map of the to-be-processed image. The density map may reflect to some extent the density of the subject contained in the image to be processed. In different application scenarios, the subject object is different. For example, in a business super-waiting inventory scenario, the image to be processed may be a shelf image, and the subject object may be a commodity on a shelf; for another example, in a scene such as a scenic spot, a mall, a station, or an airport, the image to be processed may be a crowd image, and the subject object may be a designated part of a person, such as a head, an arm, or a leg; for another example, in a traffic control scene, the image to be processed may be a road image and the subject object may be a vehicle.

In consideration of the fact that in the actual use process, because a certain error exists in the estimation of the subject density, for the density map obtained by the density estimation, besides a response value in the subject region, the background region may also have a certain response value, and if the density map obtained by the density estimation is directly used as the result of the density estimation, a certain error exists in the subject density, based on which, in step 103, subject recognition may be performed on the image to be processed to determine the subject region of the image to be processed, and the local image corresponding to the subject region of the image to be processed is used as the foreground image for realizing the image to be processed, thereby realizing the foreground and background segmentation of the image to be processed.

The main area may be the entire area of the main object or may be a local area of the main object. For example, for an embodiment in which the subject object is a human head, the subject region may be the entire human head region, or may be a face region, or a partial human head region, or the like.

Optionally, the target detection frame may also be used to label the main region of the image to be processed, and the image to be processed labeled with the target detection frame is displayed. In this way, the user can check whether the target detection box meets the requirements. The user can check whether the target detection box coincides with the main body of the image to be processed, and so on.

Optionally, a target detection box adjustment function may also be provided to the user. For the computer equipment, responding to the adjustment operation aiming at the target detection frame, and acquiring the space information of the adjusted target detection frame; and using the adjusted space information of the target detection frame as a main body area of the image to be processed. The spatial information of the target detection frame reflects the spatial distribution information of the target detection frame in the image to be processed. The spatial information of the target detection box may include: the position of the target detection frame in the image to be processed, the size of the target detection frame, and the like.

In the embodiment of the present application, steps 102 and 103 may be executed in parallel or sequentially. If

steps

102 and 103 are executed sequentially, the order of executing

steps

102 and 103 is not limited. Fig. 1a illustrates only step 102 and then step 103, but is not limited thereto.

Further, because the main body region is a region where a main body object included in the image to be processed is located, the background image in the image to be processed is removed, and based on this, in step 104, the density map obtained in step 102 may be filtered based on the main body region of the image to be processed, so as to obtain a filtered density map. In the following embodiments, for convenience of description and distinction, the density map obtained by performing subject density estimation on the image to be processed in step 102 is defined as a first density map; and defining the density map filtered in step 104 as a second density map.

In this embodiment, due to the filtered second density map, the background noise of the density map is reduced, which is further helpful to improve the accuracy of the subject density estimation and the accuracy of the subject counting according to the second density map.

Further, after obtaining the filtered second density map, the second density map may be summed to obtain the number of subjects included in the image to be processed. For example, in some embodiments, the subject is a human head, and the number of people included in the image to be processed can be obtained by performing integral summation on the second density map. Or, in some embodiments, the physical region of the image acquisition to be processed may be further controlled according to the subject density distribution reflected by the second density map. For example, for an application scenario in which the subject is a human head, the physical region for collecting the image to be processed may be subject to human flow control according to the crowd density distribution reflected by the second density map. In these scenarios, the image capture device may be a camera or the like deployed in a physical area.

In the embodiment of the present application, a specific implementation of subject identification and subject density estimation on an image to be processed is not limited. In some embodiments, a neural network model may be employed for subject identification and subject density estimation of the image to be processed. The Neural Network model may be a Convolutional Neural Network (CNN) model or the like. Further, the Convolutional neural network model may be a full Convolutional neural network (FCN) based on ResNet, MobileNet, ShuffleNet, or the like, but is not limited thereto.

Optionally, the neural network for performing subject identification and subject density estimation on the image to be processed may be two non-intersecting branch networks, or may be two branch networks sharing a backbone network. The backbone network may be VGG16, or a partial network of VGG16, for example, the first 10 convolutional layers and 3 expansion convolutional layers of VGG16 may be used. In this embodiment, for convenience of description and distinction, a branch network for subject identification of an image to be processed is defined as a first branch network, and a branch network for subject density estimation of an image to be processed is defined as a second branch network.

Further, as shown in fig. 1b, the image to be processed may be input to a backbone network for feature extraction, so as to obtain an initial image feature of the image to be processed; and performing main body recognition on the initial image features by using the first branch network to obtain a main body area of the image to be processed. Accordingly, a second branch network may be utilized to perform a subject density estimation on the initial image features to obtain the first density map.

The following is an exemplary description of a specific process of performing subject recognition on an initial image feature using the first branch network, respectively.

Optionally, the initial image features output by the backbone network may be input into the first branch network; in the first branch network, convolution processing can be carried out on the initial image characteristics to obtain first target characteristics of the image to be processed; the first target feature is a feature embodiment that the image to be processed is a main body or a background. Further, pixel points belonging to the main body in the image to be processed can be obtained according to the first target characteristic. Optionally, in the first branch network, the probability that a pixel point in the image to be processed belongs to the main body and the background may be calculated according to the first target feature; for any pixel point, if the probability that the pixel point belongs to the main body is greater than the probability that the pixel point belongs to the background, determining that the pixel point belongs to the main body; correspondingly, if the probability that the pixel point belongs to the main body is smaller than the probability that the pixel point belongs to the background, the pixel point is determined to belong to the background. Further, the main body area of the image to be processed can be determined according to the pixel points belonging to the main body in the image to be processed.

For the case where the first branch network and the second branch network share the backbone network, as shown in fig. 1b, the initial image feature may also be input into the second branch network; in a second branch network, carrying out convolution processing on the initial image characteristics to obtain second target characteristics of the image to be processed; wherein the second target feature is a feature manifestation of a subject density of the image to be processed; further, a first density map may be generated based on the second target feature.

After acquiring the main region of the image to be processed and the first density map, the first density map may be filtered based on the main region of the image to be processed to obtain a second density map of the image to be processed. The output of the first branch network is different, and the filtering mode of the first density map based on the main body area of the image to be processed is different.

In some embodiments, the output of the first branch network may be a subject region of the image to be processed. Correspondingly, the region corresponding to the main body region in the first density map can be obtained according to the position information of the main body region of the image to be processed in the image to be processed, and the response value of the region is reserved; and setting the response value of the part, which does not belong to the main body area, in the first density map to be 0, realizing the filtering of the first density map, and obtaining a filtered second density map.

In other embodiments, as shown in FIG. 1b, the output of the first branch network is a mask generated based on the subject region. Accordingly, in the first branch network, a mask of the image to be processed may also be generated based on the subject region. Wherein the region marked 1 in the mask corresponds to the main region of the image to be processed. Optionally, in the first branch network, the pixel value corresponding to the main area of the image to be processed may be set to 1, and the pixel values of the rest of the image to be processed may be set to 0, so as to obtain the mask of the image to be processed. Correspondingly, the mask of the image to be processed can be multiplied by the first density map output by the second branch network, so as to realize the filtering of the first density map and obtain the filtered second density map.

It is worth noting that in the embodiment of the application, the neural network model can be trained before the neural network model is used for carrying out subject identification and subject density estimation on the image to be processed. Taking the joint training of the backbone network, the first branch network and the second branch network as an example, a training process of the neural network model provided in the embodiment of the present application is exemplarily described below.

In the embodiment of the present application, a network architecture of the initial network model, that is, a network architecture of the initial network branch corresponding to the backbone network, the first branch network, and the second branch network, respectively, may be preset. As shown in fig. 1d, the initial network model includes: an initial feature extraction network, an initial segmentation branch network, and an initial density estimation branch network. The initial feature extraction network is an initial network model of a backbone network, the initial segmentation branch network is a network branch model of a first branch network, and the initial density estimation branch network is an initial network model of a second branch network.

Wherein, the network architecture of the initial network model comprises: convolutional layers, pooling layers, the number and order of the convolutional and pooling layers, and the hyper-parameters of each convolutional and pooling layer. Wherein, the hyper-parameters of the convolutional layer comprise: the size k (kernel size) of the convolution kernel, the size p (scaling size) of the edge extension of the feature map, the stride size s (stride size), and the number F of output feature maps. The super parameters of the pooling layer are the size K and the step size S of the pooling operation core; and so on.

In the embodiment of the present application, the network specific network architecture of the initial density estimation branch network and the initial segmentation branch network is not limited. Alternatively, the initial split branching network may include 512-channel input, 256-channel output expansion convolutional layers (scaled conv), 256-channel input, 128-channel output expansion convolutional layers (scaled conv), 128-channel input, 64-channel output expansion convolutional layers (scaled conv), and 64-channel input, 2-channel output convolutional layers (conv), and activation function layers. The activation function layer can classify pixel points of the image to be processed. Alternatively, the activation function layer may be classified using a softmax function.

Alternatively, the initial density estimation branching network may include 512-channel input, 256-channel output expansion convolutional layers (scaled conv), 256-channel input, 128-channel output expansion convolutional layers (scaled conv), 128-channel input, 64-channel output expansion convolutional layers (scaled conv), and 64-channel input, 1-channel output convolutional layers (conv), and activation function layers. The activation function layer can perform regression on the density of the pixel points of the image to be processed. Alternatively, the activation function layer may use a relu function to perform the regression.

Wherein the essence of training the initial network model is to train the model parameters to minimize the loss function. In the present example, the initial network models may be jointly trained. Correspondingly, the initial network model can be jointly trained by using the sample image to obtain the backbone network, the first branch network and the second branch network, with the joint loss function minimized as a training target. The sample image can be one frame or multiple frames, the multiple frames refer to 2 frames or more than 2 frames, and the specific values of the number of the multiple frames can be flexibly set according to actual requirements.

The joint loss function can be determined jointly according to a cross entropy function determined by a mask output by model training and a mask true value graph of the sample image, and a mean square error function determined by a product of a density graph (defined as a first training density graph) output by the model training and the mask output by the model training and the density true value graph of the sample image. And the product of the first training density map and the mask of the model training output is a filtered density map of the model training output, and is defined as a second training density map.

Further, for the training process of the first branch network (i.e., the split branch network), a mask truth map of the sample image may be obtained. Alternatively, the masked true value map of the sample image may be derived from the visibility map of the sample image. Wherein, the visibility graph refers to: in reality, the number of imaging pixels of an object of one meter in a camera is one. In the same frame of image, the closer to the camera, the larger the visibility value. Accordingly, the specific implementation of obtaining the mask true value map of the sample image is as follows: obtaining a visibility map of a sample image; carrying out scale transformation on the visibility graph of the sample image by utilizing the excitation function so as to determine a main body region of the sample image; and generating a mask true value map of the sample image based on the subject region of the sample image. Wherein, the excitation function can be a tanh function or a sigmoid function.

In this embodiment, the sample image may be an image including 1 or more subject objects. Alternatively, the sample image may be an image in an existing image library. For example, the sample image may be an image in ShanghaiTech a, ShanghaiTech B, or WorldExpo'10, and so on. For the image in WorldExpo '10, the visibility map provided in WorldExpo '10, that is, the visibility map of the sample image read from WorldExpo '10, can be used. For the sample images in ShanghaiTech a and ShanghaiTech B, a view map of the sample image may be obtained in conjunction with the camera parameters of the sample image. The specific implementation mode is as follows:

according to the camera view angle and the theorem of similar triangles as shown in FIG. 1c, it can be derived:

wherein, y_hAnd y_fRepresenting the height of the head and feet, respectively, from the image plane, as viewed from the camera; z is a radical of₁And C represent the actual height of the head and feet, respectively, from the image plane. Thus, it can be found that the human viewing height at the camera view angle is:

where f denotes the focal length of the camera and H denotes the actual height of the person.

Further, according to the formula (1) and the formula (2), it can be obtained that:

thus, the visibility value at any point in the visibility graph can be defined as:

since C is fixed for each sample image, p^gIs y_hAnd in each image, y of each line_hAre the same. To estimate C, the height h of a person at several different positions can be artificially scaled in each sample image_jThus, the visual values at the respective positions

It can be determined as:

alternatively, H may be taken as the average height of an adult. For example, H1.75 m. Accordingly, the number of the first and second electrodes,

therefore, the above formula (4) is linearly fitted by using a linear regression methodAnd obtaining a visibility map of the sample image.

After obtaining the visibility map of the sample image, the visibility map of the sample image may be scaled using an excitation function to determine a subject region of the sample image. For example, for the excitation function to be a sigmoid function, the scaling formula may be expressed as:

in the formula (5), p represents a visibility value in a visibility map,

the visibility value is expressed by scaling the visibility value in the visibility map. a, α and β are scale conversion factors. Optionally, the mean value of the visibility of any two lines of the sample image can be respectively obtained according to the visibility graph of the sample image, and the mean value is substituted into the formula (5), so that the sigmoid function is respectively equal to the set value, and then the values of alpha and beta are obtained; and then adjusting the size of the a until a detection frame which is in accordance with the size of the main object is obtained. Alternatively, the mean value of the apparent degree p _0 of the first row and the mean value of the apparent degree p _ n of the lowermost row may be found, substituted into the sigmoid function so that the sigmoid function is equal to 0.05 and 0.95, respectively, to find the values α and β. Specifically, for ShanghaiTech B, we have α ═ 0.0165 and β ═ 0.48313. And then adjusting the size of a until a detection frame which is in accordance with the size of the subject object is obtained. Wherein, the area where the detection frame is located is the main body area. For example, if the subject object is a human head or a human face, the size of a may be adjusted until a detection frame conforming to the size of the human face is obtained. For example, for ShanghaiTech B, a-24. That is, when a is 24, the face frame is more fit.

Further, a mask truth map for the sample image may be generated based on the subject region of the sample image. Optionally, the pixel value of the pixel point in the main area of the sample image may be set to 1, and the pixel value of the pixel point outside the main area of the sample image may be set to 0, so as to obtain the mask true value map of the sample image.

Based on the masked true value graph of the sample image, the loss function for segmenting the network can be represented by a cross entropy function. Wherein the cross entropy function can be expressed as:

in formula (6), H represents the number of pixels in the height direction of the sample image, and W represents the number of pixels in the width direction of the sample image.

Representing the pixel value of a pixel point with coordinates (i, j) in the mask true value graph;

and (3) representing the pixel value of a pixel point with the coordinate (i, j) in the mask output by the training of the segmentation branch network.

For the second branch network (i.e. the density estimation branch network), the loss function can adopt a mean square error loss function, which is specifically expressed as:

wherein the content of the first and second substances,

representing the density value of a pixel point with coordinates (i, j) in the density true value graph;

and (3) the density value of a pixel point with the coordinate (i, j) in a density graph (namely a second training density graph) obtained by multiplying the density graph representing the density estimation branch network training output by the mask of the first branch network training output.

Further, in the embodiment of the present application, in order to perform joint training on the initial feature extraction network, the initial segmentation branch network, and the initial density estimation branch network, the cross entropy function of the initial segmentation branch network and the mean square error loss function corresponding to the initial density estimation branch network may be determined. Specifically, the cross entropy loss function and the mean square error loss function may be multiplied by corresponding weighting factors respectively and then summed to be the joint loss function. Wherein the joint loss function can be expressed as:

l＝l_den+λl_cls (8)

in the formula (8), λ represents a weight factor corresponding to the cross entropy loss function. Optionally, λ ═ 1 e-4.

In order to more clearly illustrate the above-mentioned joint training process, the joint training process provided by the present embodiment is exemplarily illustrated below with reference to fig. 1 e. The main steps of the joint training process are as follows:

s1: and inputting the sample image into the initial feature extraction network as an input image of the initial feature extraction network.

S2: and extracting the initial image characteristics of the sample image by using the initial characteristic extraction network.

S3: and inputting the initial image features into the initial segmentation branch network and the initial density estimation branch network respectively.

S4: and determining a main body area of the sample image by using the initial segmentation branch network.

S5: a mask of the sample image is generated from the subject region of the sample image.

S6: using the initial density estimation branch network, a density map (first training density map) of the sample image is generated.

S7: and calculating the product of the density map of the sample image and the mask output by the initial segmentation branch network to obtain a filtered density map (a second training density map).

S8: and respectively substituting the mask of the initial segmentation branch network, the filtered density map (namely the second training density map) and the mask true value map and the density true value map of the sample image into the joint loss function, and calculating the joint loss function value.

S9: judging whether the training round (epoch) of the sample image reaches a preset round threshold value or not; if the determination result is negative, go to step S10; if the determination result is negative, step S11 is executed.

S10: and adjusting parameters of the initial feature extraction network, the initial segmentation branch network and the initial density map branch network by using the set optimizer logic, taking the adjusted 3 networks as the initial feature extraction network, the initial segmentation branch network and the initial density estimation branch network respectively, and returning to execute the step S1.

Alternatively, the optimizer may employ a Stochastic Gradient Descent (SGD) method or Adam, among others.

S11: and respectively using the current feature extraction network, the current segmentation branch network and the current density estimation branch network as a backbone network, a first branch network and a second branch network.

It should be noted that, in this embodiment, steps S3-S5 and S6 may be executed in parallel or sequentially, and when steps S3-S5 and S6 are executed sequentially, the order of execution of the steps is not limited, and S3-S5 or S6 may be executed first.

It should also be noted that, in the embodiment of the present application, specific values of each training index in the training process are not limited. Alternatively, the epoch threshold may be set to 400, the learning rate may be set to 1e to 7, and the batch size (batch size) may be set to 1. Wherein the epoch refers to the number of times of the round of training of each sample image in the sample image set; the batch size refers to the number of sample images input each time during the model training process.

In addition, the image processing method provided by the embodiment of the application can be deployed on any computer equipment. Optionally, the image processing method provided by the embodiment of the application can be deployed in a cloud to serve as an SaaS service. For the server device with the SaaS service deployed, the steps in the image processing method may be executed in response to a service request of other client devices. As shown in fig. 2a, the method mainly includes:

20a, responding to the image processing request event, and acquiring the image to be processed.

And 20b, carrying out main body density estimation on the image to be processed to obtain a first density map of the image to be processed.

And 20c, performing subject identification on the image to be processed to determine a subject area of the image to be processed.

And 20d, filtering the first density map based on the main body area of the image to be processed to obtain a second density map of the image to be processed.

The image processing method provided by the embodiment can be deployed in a cloud end and provides image processing service for a user. For the server equipment deploying the image processing method, the image to be processed can be acquired in response to the image processing request event. Optionally, the server device may provide an Application Program Interface (API) to the caller, and the service requester may call the API to invoke the image processing service. Accordingly, the image processing request event is implemented as a call event generated by calling the API. The service requester and the server side equipment can be connected in a communication mode. The communication connection may be a wireless or wired connection. The wireless connection may be a communication connection through a mobile network, and accordingly, the network format of the mobile network may be any one of 2G (gsm), 2.5G (gprs), 3G (WCDMA, TD-SCDMA, CDMA2000, UTMS), 4G (LTE), 4G + (LTE +), 5G, WiMax, and the like. Optionally, the service requester and the service device may also be in communication connection through bluetooth, WiFi, infrared, or the like. Alternatively, the service requester may also Call the image processing service through Remote Procedure Call (RPC) or Remote Direct data Access (RDMA) technology.

For the description of steps 20b-20c, reference is made to fig. 1a and its related contents in the alternative embodiment. Further, after obtaining the second density map of the image to be processed, the second density map may also be provided to the requester who initiated the image processing request event. For the requestor, a second density map may be received and summed to determine the number of subjects contained in the image to be processed. Optionally, the requesting party may further manage and control the physical area acquired by the image to be processed according to the main density distribution reflected by the second density map.

Optionally, in some embodiments, the second density map may also be summed to determine the number of subjects contained in the image to be processed; the number of subjects contained in the image to be processed is provided to the requestor who initiated the image processing request event. Accordingly, the requester can output the number of subjects included in the image to be processed.

In other embodiments, a control strategy for a physical region acquired by the image to be processed may be further determined according to a subject density distribution condition reflected by the second density map; and provides the governing policy to the requestor initiating the image processing request event. Accordingly, the requestor may output the governing policy.

Accordingly, embodiments of the present application also provide a computer readable storage medium storing computer instructions, which, when executed by one or more processors, cause the one or more processors to execute the steps in the image processing method.

The image processing method provided by the embodiment of the application can be applied to any application scene of main body density estimation. In some application scenarios, the image to be processed is a shelf image, and the shelf image includes at least one commodity. By using the image processing method provided by the embodiment of the application, the commodity density of the shelf image can be estimated, and the commodity density is summed to obtain the quantity of commodities contained in the shelf image, so that the goods can be checked. In a traffic control scenario, the image to be processed may be a road image, which includes at least one vehicle. By using the image subject method provided by the embodiment of the application, the density of vehicles on the road can be estimated, and the density of vehicles on the road is summed to obtain the traffic flow on the road. Further, traffic control and the like can be performed on the road according to the traffic flow on the road. In other application scenarios, the image to be processed is a crowd image, and the crowd image contains at least one human head. The crowd density can be estimated by using the image processing method provided by the embodiment of the application; and so on. The following exemplifies that the image processing method provided in the embodiment of the present application is applied to estimation of crowd density, and the image processing method is exemplarily described.

Fig. 2b is a schematic flow chart of a crowd density estimation method according to an embodiment of the present disclosure. As shown in fig. 2b, the method comprises:

201. and acquiring an image to be processed.

202. And carrying out crowd density estimation on the image to be processed to obtain a first crowd density map.

203. And performing human head recognition on the image to be processed to determine a human head area of the image to be processed.

204. And filtering the first crowd density map based on the head area of the image to be processed to obtain a second crowd density map.

In this embodiment, for a description of an image capturing device for an image to be processed and an execution subject of the crowd density estimation method, reference may be made to relevant contents of the above-mentioned image processing method embodiment, and details are not repeated herein.

In this embodiment, for the crowd density estimation apparatus, in step 201, an image to be processed may be acquired. In order to obtain the crowd density of the physical region where the to-be-processed image is acquired, in step 202, crowd density estimation may be performed on the to-be-processed image, i.e. a crowd density map of the to-be-processed image.

In consideration of the fact that in the actual use process, because a certain error exists in the crowd density estimation, for the density map obtained by the density estimation, besides a part of the human head region has a response value, the background region may also have a certain response value, and if the crowd density map obtained by the density estimation is directly used as the density estimation result, the crowd density has a certain error, based on which, in step 203, the human head recognition can be performed on the image to be processed to determine the human head region of the image to be processed, and the local image corresponding to the human head region of the image to be processed is used as the foreground image for realizing the image to be processed, thereby realizing the foreground and background segmentation of the image to be processed.

In the embodiment of the present application, steps 202 and 203 may be executed in parallel or sequentially. If

steps

202 and 203 are executed sequentially, the order of executing

steps

202 and 203 is not limited. Fig. 2b illustrates only step 202 being executed first and then step 203 being executed, but the present invention is not limited thereto.

Further, because the head region is a region where the head is included in the image to be processed, the background image in the image to be processed is removed, and based on this, in step 204, the crowd density map obtained in step 202 may be filtered based on the head region of the image to be processed, so as to obtain a filtered crowd density map. In the following embodiments, for convenience of description and distinction, the crowd density map obtained by performing crowd density estimation on the image to be processed in step 202 is defined as a first crowd density map; and defining the density map filtered in step 204 as a second population density map.

In this embodiment, due to the filtered second crowd density map, the background noise of the crowd density map is reduced, which is further helpful for improving the accuracy of crowd density estimation and the accuracy of subsequent crowd counting according to the second crowd density map.

Further, after the filtered second crowd density map is obtained, the second crowd density map may be summed to obtain the number of crowd included in the image to be processed. Or, in some embodiments, the stream of people may be managed and controlled for the physical region where the image to be processed is acquired according to the crowd density distribution reflected by the second crowd density map, and the like. In these scenarios, the image capture device may be a camera or the like deployed in a physical area.

In some embodiments, the image to be processed may be input to a backbone network for feature extraction to obtain an initial image feature of the image to be processed; and performing human head recognition on the initial image characteristics by using the first branch network to obtain a human head area of the image to be processed. Accordingly, a second branch network may be utilized to perform crowd density estimation on the initial image features to obtain the first crowd density map.

Optionally, the initial image features output by the backbone network may be input into the first branch network; in the first branch network, convolution processing can be carried out on the initial image characteristics to obtain first target characteristics of the image to be processed; the first target feature is the feature representation that the image to be processed is a human head or a background. Further, pixel points belonging to the head in the image to be processed can be obtained according to the first target characteristic. Optionally, in the first branch network, the probability that a pixel point in the image to be processed belongs to the human head and the background can be calculated according to the first target feature; for any pixel point, if the probability that the pixel point belongs to the head is greater than the probability that the pixel point belongs to the background, determining that the pixel point belongs to the head; correspondingly, if the probability that the pixel point belongs to the head is smaller than the probability that the pixel point belongs to the background, the pixel point is determined to belong to the background. Further, the head region of the image to be processed can be determined according to the pixel points belonging to the head in the image to be processed.

For the case where the first branch network and the second branch network share the backbone network, as shown in fig. 1b above, the initial image feature may also be input into the second branch network; in a second branch network, carrying out convolution processing on the initial image characteristics to obtain second target characteristics of the image to be processed; the second target feature is the feature embodiment of the crowd density of the image to be processed; further, a first population density map may be generated based on the second target feature.

After the head region and the first crowd density map of the image to be processed are obtained, the first crowd density map can be filtered based on the head region of the image to be processed to obtain a second crowd density map of the image to be processed. The output of the first branch network is different, and the filtering mode of the first crowd density map based on the head area of the image to be processed is different.

In some embodiments, the output of the first branch network may be the head region of the image to be processed. Correspondingly, an area corresponding to the head area in the first crowd density map can be obtained according to the position information of the head area of the image to be processed in the image to be processed, and the response value of the area is reserved; and setting the response value of the part, which does not belong to the head area, in the first crowd density map to be 0, so as to realize filtering of the first crowd density map and obtain a filtered second crowd density map.

In other embodiments, as shown in FIG. 1b, the output of the first branch network is a mask generated based on the head region. Accordingly, in the first branch network, a mask of the image to be processed may also be generated based on the head region. Wherein the region marked 1 in the mask corresponds to the head region of the image to be processed. Optionally, in the first branch network, the pixel value corresponding to the head region of the image to be processed may be set to 1, and the pixel values of the rest of the image to be processed may be set to 0, so as to obtain the mask of the image to be processed. Correspondingly, the mask of the image to be processed can be multiplied by the first crowd density map output by the second branch network, so that the first crowd density map is filtered, and the filtered second crowd density map is obtained.

It should be noted that, in the embodiment of the present application, the neural network model may be trained before performing the head recognition and the crowd density estimation on the image to be processed by using the neural network model. For the model training process, reference may be made to the relevant contents of the above embodiment of the image processing method, and details are not repeated here.

In order to verify the accuracy of the crowd density estimation method provided by the embodiment of the present application, the applicant uses the crowd density estimation method provided by the embodiment of the present application and a conventional crowd density estimation method to respectively perform crowd density estimation on images in ShanghaiTech a, ShanghaiTech B and WorldExpo'10 to obtain a crowd density map of the images, and performs crowd counting on the images according to the crowd density map to obtain a crowd counting error of each image as shown in table 1 below:

TABLE 1 population count error

	ShanghaiTech A	ShanghaiTech B	WorldExpo'10
				Reference MAE	68.2	10.6	8.6
MAE of the present application	60.7	7.3	8.0

The population counting error in table 1 mainly refers to the Mean Absolute Error (MAE) of the population count in each image. The reference MAE is a crowd density graph of the image obtained by estimating the crowd density of the image in the 3 image libraries by using a traditional crowd estimation method; and carrying out the average absolute error generated by the crowd counting according to the crowd density graph obtained by the traditional method. From the above table 1 it is clear that: the crowd density estimation method provided by the embodiment of the application has smaller error in crowd calculation than that of the traditional crowd density estimation method, namely the crowd density estimation method provided by the embodiment of the application is beneficial to improving the accuracy of crowd calculation.

It is further worth explaining that the crowd density estimation method provided by the embodiment of the present application can be deployed on any computer device. Optionally, the crowd density estimation method provided by the embodiment of the application can be deployed in a cloud to serve as a SaaS service. For the server-side equipment with the SaaS service, the steps in the crowd density estimation method can be executed in response to service requests of other client-side equipment, and the second crowd density graph obtained through calculation is returned to the client-side equipment; or people included in the to-be-processed image obtained according to the second crowd density map or people flow management and control measures which are determined according to the crowd density distribution condition reflected by the second crowd density map and aim at the physical region acquired by the to-be-processed image can be provided for the client device; and so on.

Accordingly, embodiments of the present application also provide a computer-readable storage medium storing computer instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the crowd density estimation method described above.

It should be noted that the execution subjects of the steps of the methods provided in the above embodiments may be the same device, or different devices may be used as the execution subjects of the methods. For example, the execution subjects of

steps

201 and 202 may be device a; for another example, the execution subject of step 201 may be device a, and the execution subject of step 202 may be device B; and so on.

In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a specific order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and the sequence numbers of the operations, such as 201, 202, etc., are merely used for distinguishing different operations, and the sequence numbers do not represent any execution order per se. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel.

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 3, the electronic apparatus includes: a memory 30a and a processor 30 b. The memory 30a is used for storing computer programs.

The processor 30b is coupled to the memory 30a for executing a computer program for: acquiring an image to be processed; carrying out main body density estimation on an image to be processed to obtain a first density map of the image to be processed; performing main body identification on the image to be processed to determine a main body area of the image to be processed; and filtering the first density map based on the main body area of the image to be processed to obtain a second density map of the image to be processed.

In some embodiments, the processor 30b, when performing subject identification on the image to be processed, is specifically configured to: inputting an image to be processed into a backbone network for feature extraction to obtain initial image features of the image to be processed; and carrying out main body identification on the initial image characteristics by utilizing the first branch network so as to obtain a main body area of the image to be processed.

Further, when the processor 30b performs the subject recognition on the initial image feature by using the first branch network, the processor is specifically configured to: inputting the initial image features into a first branch network; performing convolution processing on the initial image features in a first branch network to obtain first target features of an image to be processed; the first target feature is the feature embodiment that the image to be processed is a main body or a background; acquiring pixel points belonging to a main body in an image to be processed according to the first target characteristic; and determining the main body area of the image to be processed according to the pixel points belonging to the main body in the image to be processed.

Optionally, when obtaining the pixel point belonging to the main body in the image to be processed, the processor 30b is specifically configured to: calculating the probability that pixel points in the image to be processed belong to a main body and a background according to the first target characteristic; and aiming at any pixel point, if the probability that the pixel point belongs to the main body is greater than the probability that the pixel point belongs to the background, determining that the pixel point belongs to the main body.

In other embodiments, the processor 30b, when performing the subject density estimation on the image to be processed, is specifically configured to: and performing main body density estimation on the initial image features by using the second branch network to obtain a first density map.

Further, the processor 30b, when performing the subject density estimation on the initial image feature by using the second branch network, is specifically configured to: inputting the initial image features into a second branch network; in a second branch network, carrying out convolution processing on the initial image characteristics to obtain second target characteristics of the image to be processed; wherein the second target feature is a feature manifestation of a subject density of the image to be processed; and generating a first density map based on the second target feature.

Accordingly, the processor 30b, when filtering the first density map, is specifically configured to: generating a mask of an image to be processed based on the main body area in the first branch network; the region marked as 1 in the mask of the image to be processed corresponds to the main body region; and multiplying the mask by the first density map to obtain a second density map.

Optionally, the processor 30b is further configured to, before generating the mask of the image to be processed: performing joint training on the initial network model by using a sample image with a joint loss function minimization as a training target to obtain a backbone network, a first branch network and a second branch network; wherein the initial network model comprises: the system comprises an initial feature extraction network, an initial segmentation branch network and an initial density estimation branch network; the joint loss function is jointly determined according to a cross entropy function determined by a mask output by model training and a mask true value image of the sample image, a mean square error function determined by a product of a density image output by the model training and the mask output by the model training and a density true value image of the sample image.

Optionally, the processor 30b is further configured to: obtaining a visibility map of a sample image; carrying out scale transformation on the visibility graph of the sample image by utilizing the excitation function so as to determine a main body region of the sample image; based on the subject region of the sample image, a mask truth map is generated. Wherein the excitation function is a tanh function or a sigmoid function.

In other embodiments, the processor 30b is further configured to: summing the second density maps to determine the number of subjects contained in the image to be processed; and/or managing and controlling the physical area acquired by the image to be processed according to the body density distribution condition reflected by the second density map.

In this embodiment of the present application, optionally, the processor 30b is further configured to, before filtering the first density map: marking a main body area of the image to be processed by adopting a target detection frame; the display component 30e displays the image to be processed marked with the target detection frame, so that the user can check whether the target detection frame meets the requirement.

Further, the processor 30b is further configured to: responding to the adjustment operation aiming at the target detection frame, and acquiring the space information of the adjusted target detection frame; and using the adjusted space information of the target detection frame as a main body area of the image to be processed.

In an embodiment of the present application, the body may be a human head. Correspondingly, the processor 30b is further configured to: acquiring an image to be processed; carrying out crowd density estimation on an image to be processed to obtain a first crowd density map; performing human head recognition on the image to be processed to determine a human head area of the image to be processed; and filtering the first crowd density map based on the head area of the image to be processed to obtain a second crowd density map.

Optionally, the processor 30b is further configured to: summing the second population density map to determine the number of people included in the image to be processed; and/or carrying out people flow control on the physical area acquired by the image to be processed according to the crowd density distribution condition reflected by the second crowd density graph.

In some embodiments, the electronic device is a server device. The processor 30b is further configured to: responding to an image processing request event, and acquiring an image to be processed; carrying out main body density estimation on an image to be processed to obtain a first density map of the image to be processed; performing main body identification on the image to be processed to determine a main body area of the image to be processed; and filtering the first density map based on the main body area of the image to be processed to obtain a second density map of the image to be processed.

Optionally, the processor 30b is further configured to: providing the second density map to a requestor initiating an image processing request event via the communication component 30 c; and/or, summing the second density map by the communication component 30c to determine the number of subjects contained in the image to be processed; and provides the number of subjects included in the image to be processed to the requester who initiated the image processing request event through the communication component 30 c; and/or determining a control strategy for a physical region acquired by the image to be processed according to the main body density distribution condition reflected by the second density map; and provides the governing policy to the requester that initiated the image processing request event through the communication component 30 c.

In some optional embodiments, as shown in fig. 3, the electronic device may further include: power supply component 30d and audio component 30 f. Only some of the components are schematically shown in fig. 3, and it is not meant that the electronic device must include all of the components shown in fig. 3, nor that the electronic device only includes the components shown in fig. 3.

It should be noted that the electronic device provided in this embodiment may be an image acquisition device such as a camera or a camera, a terminal device such as a computer or a smart phone, or a server device such as a server or a cloud server array; or autonomous mobile equipment such as a robot, an unmanned vehicle or an unmanned aerial vehicle; and so on. The electronic device has different implementation forms, and the functional components may be different. For example, for an autonomous mobile device, it may also include: a drive assembly, an image capture assembly, and the like.

In embodiments of the present application, the memory is used to store computer programs and may be configured to store other various data to support operations on the device on which it is located. Wherein the processor may execute a computer program stored in the memory to implement the corresponding control logic. The memory may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

In the embodiments of the present application, the processor may be any hardware processing device that can execute the above described method logic. Alternatively, the processor may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or a Micro Controller Unit (MCU); programmable devices such as Field-Programmable Gate arrays (FPGAs), Programmable Array Logic devices (PALs), General Array Logic devices (GAL), Complex Programmable Logic Devices (CPLDs), etc. may also be used; or Advanced Reduced Instruction Set (RISC) processors (ARM), or System On Chips (SOC), etc., but is not limited thereto.

In embodiments of the present application, the communication component is configured to facilitate wired or wireless communication between the device in which it is located and other devices. The device in which the communication component is located can access a wireless network based on a communication standard, such as WiFi, 2G or 3G, 4G, 5G or a combination thereof. In an exemplary embodiment, the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component may also be implemented based on Near Field Communication (NFC) technology, Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, or other technologies.

In the embodiment of the present application, the display assembly may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display assembly includes a touch panel, the display assembly may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

In embodiments of the present application, a power supply component is configured to provide power to various components of the device in which it is located. The power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device in which the power component is located.

In embodiments of the present application, the audio component may be configured to output and/or input audio signals. For example, the audio component includes a Microphone (MIC) configured to receive an external audio signal when the device in which the audio component is located is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals. For example, for devices with language interaction functionality, voice interaction with a user may be enabled through an audio component, and so forth.

On one hand, in the electronic device provided by this embodiment, subject identification is performed on an image to be processed to determine a subject region of the image to be processed; on the other hand, the main body density of the image to be processed is estimated to obtain a density map of the image to be processed; and then, filtering the density map based on the main body area of the image to be processed to obtain the filtered density map, so that the background noise of the density map is reduced, and the accuracy of main body density estimation and the accuracy of subsequent main body counting according to the density map are improved.

It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. An image processing method, comprising:

acquiring an image to be processed;

2. The method according to claim 1, wherein the performing subject recognition on the image to be processed to determine a subject region of the image to be processed comprises:

inputting the image to be processed into a backbone network for feature extraction to obtain initial image features of the image to be processed;

and performing main body identification on the initial image characteristics by utilizing a first branch network to obtain a main body area of the image to be processed.

3. The method according to claim 2, wherein the performing subject recognition on the initial image feature by using the first branch network to obtain a subject region of the image to be processed comprises:

inputting the initial image features into a first branch network;

in the first branch network, performing convolution processing on the initial image feature to obtain a first target feature of the image to be processed; wherein the first target feature is a feature embodiment in which the image to be processed is a subject or a background;

acquiring pixel points belonging to a main body in the image to be processed according to the first target characteristic;

and determining a main body area of the image to be processed according to the pixel points belonging to the main body in the image to be processed.

4. The method according to claim 3, wherein the obtaining of the pixel point belonging to the subject in the image to be processed according to the first target feature comprises:

calculating the probability that pixel points in the image to be processed belong to a main body and a background according to the first target characteristic;

and aiming at any pixel point, if the probability that the pixel point belongs to the main body is greater than the probability that the pixel point belongs to the background, determining that the pixel point belongs to the main body.

5. The method according to claim 2, wherein the performing the subject density estimation on the image to be processed to obtain a first density map of the image to be processed comprises:

and performing main body density estimation on the initial image features by using a second branch network to obtain the first density map.

6. The method of claim 5, wherein performing a subject density estimation on the initial image feature using a second branch network to obtain the first density map comprises:

inputting the initial image features into the second branch network;

in the second branch network, performing convolution processing on the initial image features to obtain second target features of the image to be processed; wherein the second target feature is a feature manifestation of a subject density of the image to be processed;

and generating the first density map according to the second target characteristic.

7. The method according to claim 6, wherein the filtering the first density map based on the subject region of the image to be processed to obtain a second density map of the image to be processed comprises:

generating a mask of the image to be processed based on a subject region of the image to be processed in the first branch network; the region marked as 1 in the mask of the image to be processed corresponds to the main body region;

and multiplying the mask plate and the first density map to obtain the second density map.

8. The method of claim 7, further comprising, before generating the mask of the image to be processed based on the subject region of the image to be processed:

performing joint training on the initial network model by using a sample image with a joint loss function minimization as a training target to obtain the backbone network, the first branch network and the second branch network; wherein the initial network model comprises: the system comprises an initial feature extraction network, an initial segmentation branch network and an initial density estimation branch network;

the joint loss function is determined by combining a cross entropy function determined according to the mask output by model training and the mask true value graph of the sample image, and a mean square error function determined according to the product of the density graph output by the model training and the mask output by the model training and the density true value graph of the sample image.

9. The method of claim 8, further comprising:

acquiring a visibility map of the sample image;

carrying out scale transformation on the visibility graph of the sample image by utilizing an excitation function so as to determine a main body region of the sample image;

generating the mask truth map based on a subject region of the sample image.

10. The method of claim 9, wherein the excitation function is a tanh function or a sigmoid function.

11. The method of any of claims 1-9, further comprising, prior to filtering the first density map based on the subject region:

marking a main body area of the image to be processed by adopting a target detection frame;

and displaying the image to be processed marked with the target detection frame so that a user can check whether the target detection frame meets the requirement.

12. The method of claim 11, further comprising:

responding to the adjustment operation aiming at the target detection frame, and acquiring the space information of the adjusted target detection frame; and using the adjusted space information of the target detection frame as a main body area of the image to be processed.

13. The method of any one of claims 1-10, further comprising:

summing the second density maps to determine the number of subjects contained in the image to be processed;

and/or the presence of a gas in the gas,

and managing and controlling the physical area acquired by the image to be processed according to the main body density distribution condition reflected by the second density map.

14. The method of claim 13, wherein the subject is a human head.

15. A method of crowd density estimation, comprising:

acquiring an image to be processed;

16. The method of claim 15, comprising:

summing the second population density map to determine the number of people included in the image to be processed;

and/or the presence of a gas in the gas,

and according to the crowd density distribution condition reflected by the second crowd density graph, carrying out crowd flow control on the physical area acquired by the image to be processed.

17. An image processing method, comprising:

18. The method of claim 17, further comprising:

providing the second density map to a requestor that initiated the image processing request event;

and/or the presence of a gas in the gas,

summing the second density maps to determine the number of subjects contained in the image to be processed; providing the number of subjects contained in the image to be processed to a requester who initiates the image processing request event;

and/or the presence of a gas in the gas,

determining a control strategy for a physical area acquired by the image to be processed according to the main density distribution condition reflected by the second density map; and providing the management and control strategy to a requester who initiates the image processing request event.

19. An electronic device, comprising: a memory and a processor; wherein the memory is used for storing a computer program;

the processor is coupled to the memory for executing the computer program for performing the steps of the method of any of claims 1-18.

20. A computer-readable storage medium having stored thereon computer instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the method of any one of claims 1-18.