CN112115900B

CN112115900B - Image processing method, device, equipment and storage medium

Info

Publication number: CN112115900B
Application number: CN202011020045.7A
Authority: CN
Inventors: 王昌安
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2024-04-30
Anticipated expiration: 2040-09-24
Also published as: CN112115900A

Abstract

The application discloses an image processing method, an image processing device, image processing equipment and a storage medium, and belongs to the technical field of artificial intelligence. According to the embodiment of the application, when the distance conditions of different image areas in the image are considered to be different, the sizes of people's heads may be different, for each image area in the image, after the image characteristics are extracted, the crowd density is obtained according to different receptive fields, then the distance of the image area is determined by analyzing the different image areas in the image, and the crowd density of the image area is dynamically selected to be adjusted by selecting a proper receptive field, so that the obtained crowd density is more in accordance with the distance degree of the image area, the crowd density obtaining precision is improved, and the accuracy of the number of people is further improved.

Description

Image processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to an image processing method, apparatus, device, and storage medium.

Background

Along with the development of artificial intelligence technology, the intelligent data processing method is realized through computer simulation, extension and expansion of people in more and more fields, and further the data can be processed by self to replace manpower, so that the data processing efficiency is improved.

The image processing method based on artificial intelligence is an application of the artificial intelligence technology. In one scenario, an image may be processed to determine a population density in the image, and thus a number of people in the image. Currently, an image processing method generally performs feature extraction on images, predicts a density map of each image based on the extracted image features, and determines the number of people based on the density map.

The sizes of the heads of the people are different due to the fact that the distance of different image areas in the images is different, the image processing method does not consider the fact, and further the accuracy of the predicted density map is smaller and worse.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device, image processing equipment and a storage medium, which can improve estimation precision and accuracy.

In one aspect, there is provided an image processing method, the method including:

Acquiring an image to be processed;

Extracting features of the image to be processed to obtain depth features of the image to be processed;

Based on at least two receptive fields and the depth features, crowd densities of different image areas in the image to be processed are obtained, and at least two first density maps corresponding to the at least two receptive fields are obtained, wherein the at least two receptive fields are different from each other;

acquiring the matching degree between different image areas in the image to be processed and the at least two receptive fields according to the depth characteristics;

and integrating the crowd densities of different image areas in the at least two first density maps by taking the matching degree corresponding to the at least two receptive fields as weight to obtain the number of people contained in the image to be processed.

In one possible implementation, a pooling layer is included between two adjacent first convolution layers of the at least two first convolution layers;

The step of extracting the features of the image to be processed to obtain the depth features of the image to be processed, and the step of:

And after the at least one first convolution layer carries out convolution processing, carrying out pooling processing on the intermediate depth features based on the pooling layer before inputting the extracted intermediate depth features into the next first convolution layer, so as to obtain the intermediate depth features input into the next first convolution layer.

In one possible implementation manner, the classifying the intermediate features of the different image areas to obtain probability distribution of the different image areas in the image to be processed includes:

and carrying out normalization processing on the intermediate features of the different image areas to obtain probability distribution of the different image areas.

In one aspect, there is provided an image processing apparatus including:

The image acquisition module is used for acquiring an image to be processed;

the feature extraction module is used for extracting features of the image to be processed to obtain depth features of the image to be processed;

The density acquisition module is used for acquiring crowd densities of different image areas in the image to be processed based on at least two receptive fields and the depth characteristics, and obtaining at least two first density maps corresponding to the at least two receptive fields, wherein the at least two receptive fields are different from each other;

The matching degree acquisition module is used for acquiring the matching degree between different image areas in the image to be processed and the at least two receptive fields according to the depth characteristics;

And the quantity acquisition module is used for integrating the crowd densities of different image areas in the at least two first density maps by taking the matching degree corresponding to the at least two receptive fields as weight to obtain the quantity of people contained in the image to be processed.

In one possible implementation manner, the density acquisition module is configured to perform convolution processing on the depth feature based on at least two of a convolution layer, a deformable convolution layer, an onset (inception) structure, or a residual structure with different void ratios, to obtain at least two first density maps corresponding to at least two receptive fields.

In one possible implementation manner, the feature extraction module is configured to convolve the image to be processed based on at least one continuous convolution layer, and take the output of the last convolution layer as the depth feature.

In one possible implementation, the number of the at least one convolution layer is at least two; the at least one convolution layer adopts jump linkage; the two jumping link convolution layers comprise a first convolution layer and a second convolution layer, wherein the first convolution layer is used for downsampling the depth characteristics output by the previous first convolution layer; the second convolution layer is configured to upsample the depth feature of the previous second convolution layer output and the depth feature of the connected first convolution layer output.

In one possible implementation manner, the matching degree acquisition module comprises a feature acquisition unit and a normalization unit;

the feature acquisition unit is used for acquiring intermediate features of different image areas in the image to be processed according to the depth features;

The normalization unit is used for performing normalization processing on the intermediate features of the different image areas to obtain the matching degree between the different image areas and the at least two receptive fields in the image to be processed.

In one possible implementation manner, the normalization unit is configured to normalize intermediate features of the different image areas to obtain probability distributions of the different image areas in the image to be processed, where the probability distribution of one image area is used to represent a degree of matching between the image area and the at least two receptive fields, and one probability value in the probability distribution is used to represent a degree of matching between the image area and the target receptive field.

In one possible implementation manner, the feature acquisition unit is configured to:

Carrying out average pooling treatment on the depth features to obtain the depth features of the different image areas;

and respectively carrying out convolution processing on the depth features of the different image areas to obtain intermediate features of the different image areas.

In one possible implementation manner, the quantity acquisition module includes a weighting unit and a quantity acquisition unit;

The weighting unit is used for weighting crowd densities of different image areas in the at least two first density images by taking matching degrees corresponding to the at least two receptive fields as weights to obtain a second density image of the image to be processed;

the number acquisition unit is used for acquiring the number of people contained in the image to be processed according to the second density map.

In one possible implementation manner, the number obtaining unit is configured to sum the density values in the second density map to obtain the number of people contained in the image to be processed.

In one possible implementation, the steps of feature extraction, crowd density acquisition, matching degree acquisition, and number of people acquisition are performed by an image processing model;

The image processing model comprises a feature extraction network, at least two branch networks and a synthesis module; the feature extraction network is used for extracting features of the image to be processed to obtain depth features of the image to be processed; the at least two branch networks are used for executing crowd density acquisition steps based on at least two different receptive fields; the comprehensive module is used for executing the matching degree acquisition step and the number of people acquisition step.

In one possible implementation manner, the image processing model is acquired based on a sample image and a target density map corresponding to the sample image; the target density map is obtained based on the following process:

acquiring a sample image and the positions of at least two human heads in the sample image;

Generating at least two first response graphs according to the positions of the at least two heads, wherein the pixel values of the positions of the at least two heads in the first response graphs are 1, and the pixel values of the other positions are 0;

summing the at least two first response maps to obtain a second response map;

And carrying out Gaussian convolution processing on the second response graph to obtain the target density graph.

In one possible implementation, the training process of the image processing model includes:

Acquiring sample images, wherein each sample image corresponds to a target density map;

Inputting a sample image into an image processing model, and extracting features of the sample image by the image processing model to obtain sample depth features of the sample image; based on at least two different receptive fields and the sample depth features, crowd densities of different image areas in the image to be processed are obtained, and at least two sample first density maps corresponding to the at least two receptive fields are obtained; acquiring the matching degree between different image areas in the sample image and the at least two receptive fields according to the sample depth characteristics; weighting crowd densities of different image areas in the at least two sample first density maps by taking matching degrees corresponding to the at least two receptive fields as weights to obtain a sample second density map;

based on the target density map, the sample first density map and the sample second density map of the sample image, respectively obtaining a predicted loss value and a comprehensive loss value corresponding to the at least two different receptive fields;

And updating the model parameters of the image processing model based on the predicted loss value and the comprehensive loss value until the model parameters meet target conditions, and stopping to obtain a trained image processing model.

In one aspect, an electronic device is provided that includes one or more processors and one or more memories having stored therein at least one piece of program code that is loaded and executed by the one or more processors to implement various alternative implementations of the above-described image processing methods.

In one aspect, a computer readable storage medium having stored therein at least one program code loaded and executed by a processor to implement various alternative implementations of the image processing method described above is provided.

In one aspect, a computer program product or computer program is provided, the computer program product or computer program comprising one or more program codes, the one or more program codes being stored in a computer readable storage medium. The one or more processors of the electronic device are capable of reading the one or more program codes from the computer readable storage medium, the one or more processors executing the one or more program codes so that the electronic device can perform the image processing method of any one of the possible embodiments described above.

According to the embodiment of the application, when the distance conditions of different image areas in the image are considered to be different, the sizes of people's heads may be different, for each image area in the image, after the image characteristics are extracted, the crowd density is obtained according to different receptive fields, then the distance of the image area is determined by analyzing the different image areas in the image, and the crowd density of the image area is dynamically selected to be adjusted by selecting a proper receptive field, so that the obtained crowd density is more in accordance with the distance degree of the image area, the crowd density obtaining precision is improved, and the accuracy of the number of people is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic illustration of a receptive field provided by embodiments of the application;

FIG. 2 is a schematic diagram of an implementation environment of an image processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an image processing procedure in a monitored scene according to an embodiment of the present application;

FIG. 4 is a flowchart of an image processing method according to an embodiment of the present application;

FIG. 5 is a flowchart of an image processing method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a feature extraction network provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a hole convolution provided by an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a convolutional layer with void fraction according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a inception structure provided in accordance with an embodiment of the present application;

FIG. 10 is a schematic diagram of a normal convolution with a convolution kernel size of 3x3 and a sampling pattern of a deformable convolution according to an embodiment of the present disclosure;

FIG. 11 is a schematic diagram of a residual structure provided by an embodiment of the present application;

FIG. 12 is a schematic diagram of a density estimation process based on depth features according to an embodiment of the present application;

FIG. 13 is a schematic diagram of a density estimation process based on depth features according to an embodiment of the present application;

FIG. 14 is a schematic diagram of the present application before and after image processing;

Fig. 15 is a schematic structural view of an image processing apparatus according to an embodiment of the present application;

fig. 16 is a block diagram of a terminal according to an embodiment of the present application;

Fig. 17 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

The terms "first," "second," and the like in this disclosure are used for distinguishing between similar elements or items having substantially the same function and function, and it should be understood that there is no logical or chronological dependency between the terms "first," "second," and "n," and that there is no limitation on the amount and order of execution. It will be further understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another element. For example, a first image can be referred to as a second image, and similarly, a second image can be referred to as a first image, without departing from the scope of the various described examples. The first image and the second image can both be images, and in some cases, can be separate and distinct images.

The term "at least one" in the present application means one or more, and the term "plurality" in the present application means two or more, for example, a plurality of data packets means two or more data packets.

It is to be understood that the terminology used in the description of the various examples described herein is for the purpose of describing particular examples only and is not intended to be limiting. As used in the description of the various described examples and in the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It will also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. The term "and/or" is an association relationship describing an associated object, meaning that three relationships can exist, e.g., a and/or B, can be represented: a exists alone, A and B exist together, and B exists alone. In the present application, the character "/" generally indicates that the front and rear related objects are an or relationship.

It should also be understood that, in the embodiments of the present application, the sequence number of each process does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiments of the present application.

It should also be understood that determining B from a does not mean determining B from a alone, but can also determine B from a and/or other information.

It will be further understood that the terms "comprises" and/or "comprising" (also referred to as "inCludes", "inCluding", "Comprises", and/or "Comprising") when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "if" may be interpreted to mean "when" ("white" or "upon") or "in response to a determination" or "in response to detection". Similarly, depending on the context, the phrase "if a condition or event recited is detected" may be interpreted to mean "upon determination" or "in response to determination" or "upon detection of a condition or event recited or" in response to detection of a condition or event recited ".

The following description of the terms involved in the present application.

Receptive field: in the convolutional neural network, a receptive field (RECEPTIVE FIELD) is defined as the area size mapped on the input picture by the pixel points on the feature map (feature map) output by each layer of the convolutional neural network. In popular point interpretation, a receptive field is a region on the feature map corresponding to a point on the input image, as shown in FIG. 1. After the convolution process, a point on the subsequent feature map corresponds to an area on the previous feature map. If the receptive field is large, the area is larger, the semantic level of the features in the feature map is higher, and the feature map has global property; if the receptive field is small, the area is smaller, the semantic level of the features in the feature map is lower, and the feature map has locality.

Upsampling and downsampling: in the downsampling process, the features of a picture are extracted, and the key parts of the picture are actually extracted, so that the resolution of the picture is reduced, and the picture is reduced; in the up-sampling process, to restore the size of the picture and increase the resolution of the picture, some methods are needed, and any technique that can change the picture to high resolution can be called up-sampling.

Bilinear interpolation: also known as bilinear interpolation. Bilinear interpolation includes nearest neighbor interpolation, bilinear interpolation, bicubic interpolation, and the like. For linear interpolation, linear interpolation is an interpolation method for one-dimensional data, which performs numerical estimation from two data points adjacent to each other on the left and right sides of a point to be interpolated in a one-dimensional data sequence, and assigns their specific gravity according to distances to the two points. For bilinear interpolation, it can be understood as a two-step linear interpolation: the interpolation is carried out in the x direction, and the interpolation result in the x direction is used for interpolation in the y direction. Bilinear interpolation is one way of image scaling.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing to make the Computer process an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

The scheme provided by the embodiment of the application relates to technologies of image processing, image semantic understanding and the like in computer vision of artificial intelligence, and technologies of neural network learning and the like in machine learning, and is specifically described by the following embodiments.

The environment in which the present application is implemented is described below.

Fig. 2 is a schematic diagram of an implementation environment of an image processing method according to an embodiment of the present application. The implementation environment includes a terminal 101 or the implementation environment includes a terminal 101 and an image processing platform 102. The terminal 101 is connected to the image processing platform 102 via a wireless network or a wired network.

The terminal 101 can be at least one of a smart phone, a game console, a desktop computer, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III, dynamic video expert compression standard audio plane 3) player, or an MP4 (Moving Picture Experts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) player, a laptop portable computer. The terminal 101 installs and runs an application program supporting image processing, which can be, for example, a system application, a shopping application, an online video application, a social application.

The terminal 101 can have an image capturing function and an image processing function, and can process a captured image and execute a corresponding function according to the processing result, for example. The terminal 101 is capable of doing this independently and also of providing data services to it through the image processing platform 102. The embodiment of the present application is not limited thereto.

The image processing platform 102 includes at least one of a server, a plurality of servers, a cloud computing platform, and a virtualization center. The image processing platform 102 is used to provide background services for applications that support image processing. Optionally, the image processing platform 102 takes on primary processing work and the terminal 101 takes on secondary processing work; or the image processing platform 102 takes on secondary processing work, and the terminal 101 takes on primary processing work; or the image processing platform 102 or the terminal 101, respectively, can solely take on processing work. Or the image processing platform 102 and the terminal 101 adopt a distributed computing architecture to perform cooperative computing.

Optionally, the image processing platform 102 includes at least one server 1021 and a database 1022, where the database 1022 is used to store data, and in an embodiment of the present application, the database 1022 can store sample images to provide data services for the at least one server 1021.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms. The terminal can be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc.

Those skilled in the art will appreciate that the number of terminals 101 and servers 1021 can be greater or fewer. For example, the number of the terminals 101 and the servers 1021 can be only one, or the number of the terminals 101 and the servers 1021 can be tens or hundreds, or more, and the number and the device type of the terminals or the servers are not limited in the embodiment of the present application.

The image processing method provided by the application can be applied to any people counting scene, for example, in a monitoring scene, the image shot by monitoring can be processed, and the number of people contained in the image can be determined. For example, as shown in fig. 3, a camera 301 may be provided in some large-scale sites, and the camera 301 monitors the sites in real time and sends the captured image or video 302 to an image processing platform 303. The image processing platform 303 can process the shot image or video 302, determine the number of people contained in any frame or each frame of the image or video, and monitor the number of people in the place.

Fig. 4 is a flowchart of an image processing method according to an embodiment of the present application, where the method is applied to an electronic device, and the electronic device is a terminal or a server, and referring to fig. 4, the method includes the following steps.

401. The electronic device acquires an image to be processed.

402. And the electronic equipment performs feature extraction on the image to be processed to obtain depth features of the image to be processed.

Features refer to the characteristics of an event as opposed to other things. The characteristics of the image to be processed are indicated by the depth characteristics by extracting the characteristics of the image to be processed, so that the image content can be analyzed, and the crowd density can be determined.

403. The electronic equipment obtains crowd densities of different image areas in the image to be processed based on at least two receptive fields and the depth feature, and at least two first density maps corresponding to the at least two receptive fields are obtained, wherein the at least two receptive fields are different from each other.

After obtaining the depth features, the electronic device may perform crowd density estimation for the depth features for different receptive fields, where the crowd densities obtained through the different receptive fields may be different for the same image area. The at least two first density maps thus include crowd densities obtained for the image area with different accuracies.

404. And the electronic equipment acquires the matching degree between different image areas in the image to be processed and the at least two receptive fields according to the depth characteristics.

The matching degree is used for indicating the matching degree of the receptive field and the image areas, the crowd density of some image areas determined by the small receptive field is more accurate, and the crowd density of some image areas determined by the large receptive field is more accurate. The matching degree is used for measuring the accuracy degree of a certain image area when the image area is determined by various receptive fields.

405. And the electronic equipment synthesizes the crowd densities of different image areas in the at least two first density maps by taking the matching degree corresponding to the at least two receptive fields as a weight to obtain the number of people contained in the image to be processed.

When the first density images obtained based on different receptive fields are obtained, due to the fact that the far and near conditions of different image areas are different, the receptive fields which are possibly adapted are different, the crowd density in at least two first density images is integrated by taking the matching degree as a basis, the crowd density obtained by the appropriate receptive field can be dynamically selected for each image area according to the image content, and therefore the number of people is determined more accurately and the accuracy is higher.

Fig. 5 is a flowchart of an image processing method according to an embodiment of the present application, referring to fig. 5, the method includes the following steps.

501. The electronic device acquires an image to be processed.

In the embodiment of the application, the electronic equipment has an image processing function and can process the image to be processed to determine the number of people contained in the image.

In some embodiments, the image to be processed may include one or more persons, and the electronic device may process the image to be processed to determine the number of persons included in the image to be processed. In other embodiments, the image to be processed may not include people, and the electronic device may determine that the number of people is zero after processing the image to be processed.

The electronic device may acquire the image to be processed in a variety of ways. The electronic device may be a terminal, or a server.

In some embodiments, the electronic device is a terminal. In one possible implementation of this embodiment, the terminal has an image acquisition function, and the terminal can acquire an image as the image to be processed. In another possible implementation of this embodiment, the terminal is able to download an image from the target website as the image to be processed. In another possible implementation of this embodiment, the terminal may extract an image from the image database as the image to be processed. In another possible implementation manner of this embodiment, the terminal may acquire the imported image as the image to be processed in response to an image importing operation.

In other embodiments, the electronic device may be a server. In one possible implementation of this embodiment, the server is capable of receiving images acquired and transmitted by the terminal. In another possible implementation of this embodiment, the server may also be downloaded from the target website. In another possible implementation of this embodiment, the server may also extract an image from an image database as the image to be processed.

The foregoing provides only a few possible ways of obtaining the image to be processed, and of course, the terminal or the server may also obtain the image to be processed in other ways.

502. The electronic device inputs the image to be processed into the image processing model.

After the electronic equipment acquires the image to be processed, the image to be processed needs to be processed. In this embodiment, the image processing method can be implemented by an image processing model, and the electronic device may call the image processing model, input the image to be processed into the image processing model, and execute the subsequent image processing step by the image processing model. In other embodiments, the electronic device may also directly perform the subsequent image processing steps, without being implemented based on the image processing model, which is not limited by the embodiment of the present application.

For the image processing model, in some embodiments, the image processing model may be trained in the electronic device. In other embodiments, the image processing model may be sent to the electronic device by the other electronic device after training on the other electronic device is completed, and the electronic device may invoke the image processing model during image processing. The embodiment of the application does not limit on which equipment the training process of the image processing model is performed.

The explanation is made below for this image processing model.

In one possible implementation, the image processing model may include that the image processing model includes a feature extraction network, at least two branching networks, and a synthesis module. In the image processing process, the feature extraction network is used for extracting features of the image to be processed to obtain depth features of the image to be processed, namely, the following step 503; the at least two branch networks are configured to perform crowd density acquisition steps based on at least two different receptive fields, i.e., step 504; the integration module is configured to perform the matching degree obtaining step and the number of people obtaining step, i.e. steps 505 to 507 described below.

The image processing model is obtained through training based on a sample image and a target density map corresponding to the sample image. The target density map is a real density map of the sample image, and a pixel value of each pixel point in the target density map is a crowd density at the pixel point. The target density map is used as a true value, and the image processing model is trained, so that the trained image processing model can process a sample image to obtain the target density map, or an output result is very close to the target density map, and therefore the image processing capability of the image processing model is improved, and the image can be accurately processed.

For the target density map, the target density map can be determined according to the position of the head in the sample image. In some embodiments, the target density map is obtained based on the following steps one to four.

Step one, an electronic device acquires a sample image and positions of at least two human heads in the sample image.

In the first step, the positions of all the heads in the sample image can be determined, and the positions of all the heads are used as the basis for determining the target density map.

In some embodiments, the center point of the head may be used as the location of the head. In this embodiment, if the center point of the human head is in the sample image, the human head may be counted as one of the persons contained in the sample image. If the center point of the head is not in the sample image, the head is not counted as one of the persons contained in the sample image. For example, the sample image contains only a small portion of the head, and the center position of the head is not in the sample image, so that the head of the small portion is not counted when the number of people contained in the sample image is calculated.

The positions of the at least two heads can be marked by related technicians, can be obtained according to target detection or can be obtained by combining a manual marking method on the basis of target detection, and the embodiment of the application is not limited to the positions.

Step two, the electronic equipment generates at least two first response graphs according to the positions of the at least two heads, wherein the pixel value of the positions of the at least two heads in the first response graphs is 1, and the pixel values of other positions are 0.

In the second step, the first response chart can reflect whether each pixel point includes a center point of the head, and if so, the pixel value of the pixel point is 1. If not, the pixel value of the pixel point is 0. The description is given here with the example that "1" represents inclusion and "0" represents non-inclusion, but of course, "0" represents inclusion and "1" represents non-inclusion may be provided, and the embodiment of the present application is not limited thereto. By whether each pixel point contains the center point of the head, the head density can be reflected to some extent.

And thirdly, the electronic equipment sums the at least two first response graphs to obtain a second response graph.

In the third step, the first response map reflects that a corresponding first response map is generated for the position of each head, and the electronic device may sum the plurality of first response maps and synthesize the plurality of first response maps to obtain a second response map of the sample image.

And step four, the electronic equipment carries out Gaussian convolution processing on the second response graph to obtain the target density graph.

In the second response chart, the crowd density is concentrated at the center point of the head, the electronic equipment can perform Gaussian convolution processing on the crowd density, the crowd density can be dispersed into the pixel points around the center point of the head, and then the real crowd density of each pixel point of the image is determined. For each head, if the contribution value of the head to the density of the surrounding pixel points decays according to a gaussian function, a gaussian convolution process can be adopted when the second response map is processed, that is, the second response map is processed through a gaussian convolution check.

For example, in one specific example, the learning goal of the model is a population density distribution thermodynamic diagram, referred to herein simply as a density map or thermodynamic diagram. The thermodynamic diagram reflects the average number of people in the corresponding position of the unit pixel in the actual scene (i.e., the sample image), and in order to generate the crowd density map (i.e., the target density map), the target density map of the entire map is generated for the N head center points x1, …, xn in the map are considered. For each head center xi, we generate a two-dimensional response chart Hi (i.e. the first response chart), where only the pixel value of the head center is 1, and the rest positions are 0. Wherein N, i are positive integers. And then adding Hi corresponding to the center point of all the heads to obtain a response diagram H (namely a second response diagram) of all the heads in the original image, wherein obviously, the integral value of the response diagram H is the total number of people. Then, for each human head, we assume that its contribution value to the density of surrounding pixel points decays according to a gaussian function, so that a normalized gaussian kernel G _σ can be used to convolve the response map H to obtain a density map D (i.e., a target density map).

The process is the acquisition process of the target density map, and the image processing model can be used for predicting the density map through the image processing model when the image processing model is used, so that the total number of people is obtained through integration. The training process of the image processing model is to train model parameters of the image processing model, so that the predicted density map obtained by prediction of the image processing model has small phase difference with the target density map and is as close as possible.

In some embodiments, the training process of the image processing model may be implemented through steps one to four.

Step one, an electronic device acquires sample images, and each sample image corresponds to a target density map. The target density map is a true, correct density map.

Step two, the electronic equipment inputs the sample image into an image processing model, and the image processing model performs feature extraction on the sample image to obtain sample depth features of the sample image; based on at least two different receptive fields and the sample depth characteristics, crowd densities of different image areas in the image to be processed are obtained, and at least two sample first density maps corresponding to the at least two receptive fields are obtained; acquiring the matching degree between different image areas in the sample image and the at least two receptive fields according to the sample depth characteristics; and weighting the crowd densities of different image areas in the at least two sample first density maps by taking the matching degree corresponding to the at least two receptive fields as a weight to obtain a sample second density map.

In the second step, the electronic device can input the sample image into the image processing model, and the image processing model executes a series of steps to determine a second density map of the sample for each sample image, where the process of determining the second density map of the sample is the same as steps 503 to 506 described below, which will be described in detail later.

In the process of determining the second density map of the sample, analyzing the sample image according to the depth characteristics of the sample, estimating crowd density by using different receptive fields to obtain a first density map, and analyzing which receptive field is more suitable for the far and near condition of each image area in the sample image, so as to integrate the first density map and determine a more accurate second density map.

And thirdly, the electronic equipment respectively acquires a predicted loss value and a comprehensive loss value corresponding to the at least two different receptive fields based on the target density map, the first sample density map and the second sample density map of the sample image.

In the third step, the predicted loss value is determined based on the target density map and the first sample density map, and the predicted loss value is obtained for each prediction result of the receptive field, so that the predicted loss value can be trained in the training process, and the prediction capacity under different receptive fields is improved. The comprehensive loss value is determined based on the target density map and the sample second density map, and can be used for training to improve the accuracy of the comprehensive predicted result after the predicted results of various different receptive fields are synthesized.

In one possible implementation, the electronic device may obtain the predicted loss value and the composite loss value based on an MSE (Mean Square Error ) loss function.

Specifically, the electronic device may obtain the predicted loss value or the integrated loss value through the following formula one.

Where L _reg represents the loss value, N represents the total number of pixels in the training image,For the actual value of the density map of the ith pixel (i.e., the density value in the target density map), z _i is the predicted value of the network (i.e., the density value in the first density map of the sample or the second density map of the sample), and the predicted distribution thermodynamic diagram of the network (i.e., the density value in the first density map of the sample or the second density map of the sample) can be finally made as close as possible to the actual density map by optimizing the network.

And step four, the electronic equipment updates the model parameters of the image processing model based on the predicted loss value and the comprehensive loss value until the model parameters meet target conditions, and the trained image processing model is obtained.

The first to third steps are iterative processes, and after the model parameters are updated, the iterative processes can be continuously executed until the training target is reached. The target condition may be that the predicted loss value and the integrated loss value converge, or that the number of model iterations reaches a target number, which is not limited.

503. And the electronic equipment performs feature extraction on the image to be processed based on a feature extraction network of the image processing model to obtain depth features of the image to be processed.

After the electronic equipment inputs the image to be processed into the image processing model, the image processing model can firstly extract the depth characteristics of the image to be processed, the depth characteristics are used for representing the information of each image area in the image to be processed, and then regression processing can be carried out on the depth characteristics to determine crowd density.

In some embodiments, the feature extraction network is a backbone network of the image processing model, and after the backbone network extracts the depth features, the depth features can be input into each branch network, and the branch networks process the depth features respectively.

In one possible implementation, the feature extraction network includes at least one convolution layer in succession. The electronic device may convolve the image to be processed based on the successive at least one convolution layer with the output of the last convolution layer being the depth feature. The successive at least one convolution layer is adjacent to each other.

In some embodiments, the output of the previous convolutional layer serves as the input of the subsequent convolutional layer. In other embodiments, the output of the previous convolution layer may be further processed and then used as the input of the next convolution layer. After the depth features are convolved based on at least one continuous convolution layer, information contained in the image to be processed can be accurately represented.

Wherein the number of the at least one convolution layer is one or more (i.e., at least two). In some embodiments, the number of the at least one convolution layer is at least two, i.e., a plurality.

In some embodiments, the at least one convolution layer may employ a skip link, wherein two convolution layers of the skip link include a first convolution layer and a second convolution layer, the first convolution layer configured to downsample a depth feature output by a previous first convolution layer; the second convolution layer is configured to upsample the depth feature of the previous second convolution layer output and the depth feature of the connected first convolution layer output.

The depth features output by the convolution layers which are relatively ahead can be comprehensively compared with the depth features output by the convolution layers which are relatively behind to be jointly used as the input of a certain convolution layer through jump linking, so that the input of the certain convolution layer comprises the context features with high semantic information and local detail information obtained through the step-by-step convolution processing of a plurality of convolution layers. It can be appreciated that by jumping the links, detailed information can be introduced during up-sampling, and the extracted depth features are more perfect and accurate.

In one possible implementation, a pooling layer is included between two adjacent first convolution layers of the at least two first convolution layers, through which spatial downsampling is achieved. Thus, in the feature extraction process, after the at least one first convolution layer performs convolution processing, before inputting the extracted intermediate depth feature into the next first convolution layer, the electronic device further performs pooling processing on the intermediate depth feature based on the pooling layer, so as to obtain the intermediate depth feature input into the next first convolution layer.

For example, in one specific example, the feature extraction network may employ a VGG16 network. The feature extraction network may employ a U-shaped network structure that downsamples and then upsamples. As shown in fig. 6, the left part in fig. 6 is a VGG16 front-end network for downsampling. The first convolution layer is "ConvBlock (Convolution Block )", and the second convolution layer is "convolution layer". For ConvBlock to ConvBlock, each ConvBlock consists of a plurality of successive convolution layers. For ConvBlock to ConvBlock, the number of internal convolution layers is 2,3, respectively. The number of channels of all convolutions in the same ConvBlock is uniform. For ConvBlock to ConvBlock, the number of convolution channels is 64, 128, 256, 512, respectively. The right part in fig. 6 is the upsampling part. The up-sampled convolutional layer can integrate the output of the last convolutional layer and the output of the jump link ConvBlock, and the integrated output is used as the input of the convolutional layer. The integration process may be an element-by-element addition process. Alternatively, the upsampling process may be a bilinear upsampling process, where linear interpolation is performed in the X, Y directions, respectively, and the depth features obtained by downsampling are filled to obtain final depth features. Spatial downsampling is achieved between each ConvBlock through maximum pooling (Maxpool), increasing the network receptive field and local translational invariance.

504. The electronic equipment inputs the depth characteristics of the image to be processed into at least two branch networks, and the at least two branch networks acquire the crowd density of different image areas in the image to be processed based on at least two receptive fields and the depth characteristics, so as to obtain at least two first density maps corresponding to the at least two receptive fields, wherein the at least two receptive fields are different from each other.

Each branch network is used for processing depth characteristics based on a receptive field or a receptive field range, so as to obtain crowd density of different image areas and obtain a first density map of the receptive field or the receptive field range. And obtaining first density maps corresponding to different receptive fields through a plurality of branched networks. Compared with the mode of carrying out crowd density estimation by using fixed receptive fields in the related art, the crowd density estimation accuracy is improved by considering that the sizes of heads of people are possibly different when the distance conditions of different image areas in the images are different and the crowd density estimation is carried out by adopting different receptive fields.

For the at least two branched networks, the at least two different receptive fields can be implemented by different network structures, e.g., the at least two branched networks can include convolution layers of different void fractions. For another example, one of the at least two branch networks may be a deformable convolutional layer. For another example, one of the at least two branch networks may be an originating (inception) structure. For another example, one of the at least two branch networks may be a residual structure. The embodiment of the application can arbitrarily combine the branch network structures, only needs to ensure that at least two branch networks have enough difference, and can ensure that each branch can adaptively learn the expression capability applicable to different scales. The number of branch networks and the structure of each branch network are not limited in the embodiment of the present application.

In step 504, the electronic device may perform convolution processing on the depth feature based on at least two of the convolution layers, the deformable convolution layers, the startup (inception) structure, or the residual structure with different void ratios, to obtain at least two first density maps corresponding to at least two receptive fields.

In one particular example, each branch network may be referred to as each prediction head, each prediction head being used to predict a first density map of a receptive field or receptive field range. Each pre-measurement head can adopt any one of the structures, and a first density map corresponding to different sensing fields can be obtained through different convolution processes.

The above described convolutional layer with void fraction, deformable convolutional layer, start (inception) structure, or residual structure is explained below.

For a convolution layer with a void fraction, the convolution layer with a void fraction is capable of performing a void convolution on input data. Wherein, the cavity convolution is to inject the cavity into the standard convolution kernel so as to increase the receptive field. Compared with the original normal convolution operation, the cavity convolution has one more hyper-parameter, namely the cavity rate (dilation rate). The void fraction refers to the number of intervals of cores (kernel). For example, as shown in fig. 7, by taking 3*3 convolution kernels as an example, 0 is filled in the middle of 3*3 convolution kernels, and sampling is performed at intervals. When the number of the intervals is different, the void ratio is different.

As shown in fig. 8, one or more of the at least two pre-measurement heads may adopt the structure of the convolution layer with the void fraction, that is, one or some branch networks are three 3x3 convolution layers with the void fraction d. Wherein d is a positive number.

For inception structures, the inception structure includes multiple branches, each branch (which can be regarded as multiple filters) adopts different convolution layers, and performs convolution processing of different scales on the output of the Previous Layer (Previous Layer), and finally, the results of the multiple branches (that is, the filters) are integrated to obtain a first density map corresponding to the receptive field in a very small range.

For example, fig. 9 shows a inception structure. A branched network is inception as shown in fig. 9, and the inception structure has two functions: one uses a convolution of 1x1 to perform the lifting dimension; and secondly, convolution repolymerization is performed on multiple sizes at the same time. Thus, the inception structure can process the depth characteristic to obtain a crowd density estimation result which is based on that a receptive field is located in a very small range.

For inception structure, the following is explained:

For the multiple 1x1 convolution layers of fig. 9, the effect is to superimpose more convolutions in the same size receptive field, which can extract richer features. Three 1x1 convolutions in fig. 9 all play this role. The 1x1 convolution layer can also play a role in reducing dimension, and the calculation complexity is reduced. The middle 3x3 convolution and the 1x1 convolution before the 5x5 convolution in fig. 9 both play a role in dimension reduction. When the number of the features input by a certain convolution layer is more, the convolution operation is carried out on the input, so that huge calculated amount is generated; if the dimension of the input is reduced, the convolution calculation amount is obviously reduced after the feature number is reduced.

In fig. 9, 4 branches are made for the input depth features, and different size filters (filters) are used to convolve or pool the input depth features, and finally the input depth features are spliced together in feature dimensions. Thus, the convolution is performed on multiple scales simultaneously in visual sense, and the features of different scales can be extracted. The richer features also mean more accurate final regression decisions. And the convergence rate is increased by utilizing the principle that the sparse matrix is decomposed into the dense matrix for calculation. One branch in the inception module uses maximum pooling (max pooling), which also serves to extract features (pooling).

Fig. 9 shows only one inception structure, the inception structure may take other forms, and the relevant person may adjust the convolution layers in the inception structure as desired. The embodiment of the present application is not limited thereto.

For deformable convolutional layers, the deformable convolutional layer is obtained by adding an offset to the normal convolutional layer, the added offset being part of the network structure. In addition, the size and the position of the deformable convolution kernel can be dynamically adjusted according to the image content which is required to be identified or classified at present, and the visual effect is that the positions of convolution kernel sampling points at different positions can be adaptively changed according to the image content, so that the self-adaptive adjustment of the receptive field is realized according to the geometric deformation of the shape, the size and the like of different objects.

The normal convolution with a convolution kernel size of 3x3 and the sampling pattern of the deformable convolution are explained below with reference to fig. 10. As shown in fig. 10, (a) in fig. 10 is a normal convolution layer, and (b) and (c) in fig. 10 are variable convolution layers. The normal convolution samples 9 points regularly and (b) and (c) add a displacement (indicated by an arrow, which may also be referred to as an offset) to the normal sample coordinates.

With respect to the residual structure, the residual in the mathematical statistics refers to the difference between the actual observed value and the estimated value (fitting value). The residual structure can obtain an actual observed value and an estimated value aiming at a certain variable through a jump connection or identity mapping mode, and further obtains a residual. For example, the optimization objective of a certain network may be H (x) =f (x) +x, and the optimization objective can be converted from H (x) to H (x) -x by a residual structure. At this time, the upper layers are not trained to an equivalent mapping, but are approximated to 0, so that the training difficulty can be effectively reduced. For example, as shown in fig. 11, a certain branch network has a residual structure, and a residual is introduced by the residual structure, so that a crowd density map obtained based on a very thin receptive field can be obtained.

505. And the electronic equipment is based on a comprehensive module of the image processing model, and obtains the matching degree between different image areas in the image to be processed and the at least two receptive fields according to the depth characteristics.

After the electronic equipment acquires the first density maps corresponding to different receptive fields, the electronic equipment can use the comprehensive module to analyze the image content and dynamically integrate the first density maps based on analysis results. During integration, the image content can be analyzed through depth features, and the matching degree between the near-far situation and the different sensing fields of different image areas (which can be smaller) can be determined. Specifically, the step 505 may be implemented through the first step and the second step.

Step one, the electronic equipment acquires intermediate characteristics of different image areas in the image to be processed according to the depth characteristics.

The electronic device may process the depth feature to convert it into an intermediate feature corresponding to the number of receptive fields. Specifically, the electronic device may perform an average pooling process on the depth features to obtain depth features of the different image areas, and perform a convolution process on the depth features of the different image areas to obtain intermediate features of the different image areas.

For example, as shown in fig. 12, the electronic device may change the feature map into k×k lattices by using an adaptive averaging pooling module, and then perform feature transformation on each lattice by using a1×1 convolution layer, where each lattice predicts n values, corresponding to n prediction heads.

And step two, the electronic equipment performs normalization processing on the intermediate features of the different image areas to obtain the matching degree between the different image areas and the at least two receptive fields in the image to be processed.

In some embodiments, the matching degree may be in a probability distribution manner, and specifically, the electronic device may normalize intermediate features of the different image areas to obtain probability distributions of the different image areas in the image to be processed, where a probability distribution of one image area is used to represent the matching degree between the image area and the at least two receptive fields, and one probability value in the probability distribution is used to represent the matching degree between the image area and the target receptive field.

For example, as shown in fig. 12, after obtaining the intermediate features, the electronic device may use softmax to obtain probability distributions for the n pre-measurement heads. As shown in fig. 13, each of the branches serves as a prediction head, and after depth features are extracted through the feature extraction network, the depth features can be predicted by n prediction heads, respectively, to obtain a corresponding first density map.

The above is merely an exemplary illustration, and the electronic device may also employ more complex or efficient structures, such as a context modeling module that incorporates multiple branches between averaging pooling and 1x1 convolution, such as Inception structures, and the like. The embodiment of the present application is not limited thereto.

506. And the electronic equipment weights the crowd density of different image areas in the at least two first density images by taking the matching degree corresponding to the at least two receptive fields as a weight based on the comprehensive module of the image processing model, so as to obtain a second density image of the image to be processed.

And the electronic equipment analyzes how much receptive field is more suitable for each image area in the image to be processed when determining the matching degree. The distance of the person in different image areas may be different, so that the size of the head of the person is different, and the receptive field adapted to the head is different. With the matching degree as a weight, a more suitable receptive field can be determined for each image region, and thus the pixel value of each image region in the weighted second density map is closer to the pixel value in the first density map based on the more suitable receptive field.

The second density map is acquired accurately in the image area, and compared with the mode of estimating the crowd density of the whole image based on the fixed receptive field in the related art, the density estimation accuracy is greatly improved, the accuracy of the second density map is improved, and the number of people determined in the following steps is more accurate.

The foregoing description has been made by taking the integration module to adaptively integrate the first density maps corresponding to at least two receptive fields according to the matching degree, to obtain the second density map as an example, and the process of obtaining the second density map may also be implemented in other manners, for example, the electronic device may select a first density map region of one receptive field for each image region according to the matching degree, and further, combine the first density map regions corresponding to each image region to obtain the second density map.

507. And the electronic equipment acquires the number of people contained in the image to be processed according to the second density map based on the comprehensive module of the image processing model.

After the electronic device obtains the second density map, the second density map can accurately reflect the crowd density of each pixel point in the image to be processed, namely the average number of people in each pixel point, and the number of people contained in the image to be processed, namely the total number of people in the image to be processed, can be obtained based on the second density map.

In some embodiments, the electronic device may sum the density values in the second density map to obtain the number of people contained in the image to be processed. In this step 507, the electronics sum the density values in the second density map to obtain a sum of all density values. Since each density value represents the population density of each pixel, that is, the average number of people in each pixel, the average number of people in all pixels is summed to obtain the number of people contained in the image to be processed.

In other embodiments, the electronic device may perform the summation of the density values in the second density map by integrating, i.e. may be able to integrate the density values in the second density map along the abscissa or the ordinate.

It should be noted that, in the steps 506 and 507, the matching degree corresponding to the at least two receptive fields is taken as a weight, the population densities of different image areas in the at least two first density images are synthesized to obtain the number of people contained in the image to be processed, and the process is described by taking the process of weighting the first density images through the matching degree to obtain the second density image and obtaining the total number of people based on the second density image as an example. In some embodiments, the process may be implemented in other manners, for example, the electronic device may select a receptive field first density map area for each image area according to the matching degree, and then combine the areas in the first density map corresponding to each image area to obtain a second density map, so as to obtain the total number of people. For another example, the electronic device may directly perform weighted summation on the density values in the first density map according to the matching degree to obtain the total number of people, and the embodiment of the present application does not limit what mode is specifically adopted.

In a possible implementation manner, after the step 506, the electronic device may output the second density map after obtaining the second density map based on the image processing model, and display the second density map, or display each image area in the image to be processed according to a display style corresponding to the crowd density of each image area in the second density map. The display style of each image area in the image to be processed corresponds to the density value in the second density chart.

For example, as shown in fig. 14, by applying the image processing method provided by the application, the graph (a) in fig. 14 can be processed to determine the total number of people in the image, for example, the total number of people is 208, and the crowd density in the image can be displayed according to the determined second density graph, as shown in the graph (b) in fig. 14. Headcount 208 is shown in (b) where crowd density may be different for different image areas, and may be shown in different colors, which is illustrated in (b) where different patterns replace colors.

The above steps 502 to 507 are described only by taking the image processing model as an example, and in some embodiments, the image processing method may be implemented without using the image processing model. In this embodiment, after the electronic device obtains the image to be processed, the electronic device may directly execute the image processing steps similar to the steps 502 to 507, which is just implemented without an image processing model, and the embodiment of the present application does not limit what mode is specifically adopted.

All the above optional solutions can be combined to form an optional embodiment of the present application, and will not be described in detail herein.

Fig. 15 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application, referring to fig. 15, the apparatus includes:

an image acquisition module 1501 for acquiring an image to be processed;

a feature extraction module 1502, configured to perform feature extraction on the image to be processed, so as to obtain depth features of the image to be processed;

The density obtaining module 1503 is configured to obtain population densities of different image areas in the image to be processed based on at least two receptive fields and the depth feature, so as to obtain at least two first density maps corresponding to the at least two receptive fields, where the at least two receptive fields are different from each other;

A matching degree obtaining module 1504, configured to obtain matching degrees between different image areas in the image to be processed and the at least two receptive fields according to the depth features;

the number obtaining module 1505 is configured to synthesize population densities of different image areas in the at least two first density maps with the matching degree corresponding to the at least two receptive fields as a weight, so as to obtain the number of people contained in the image to be processed.

In one possible implementation manner, the density obtaining module 1503 is configured to perform convolution processing on the depth feature based on at least two of a convolution layer, a deformable convolution layer, an original inception structure, or a residual structure with different void ratios, to obtain at least two first density maps corresponding to at least two receptive fields.

In one possible implementation, the feature extraction module 1502 is configured to convolve the image to be processed based on at least one convolution layer in succession, and take the output of the last convolution layer as the depth feature.

In one possible implementation, the matching degree obtaining module 1504 includes a feature obtaining unit and a normalizing unit;

In one possible implementation, the feature acquisition unit is configured to:

In one possible implementation, the number acquisition module 1505 includes a weighting unit and a number acquisition unit;

the weighting unit is used for weighting crowd densities of different image areas in the at least two first density images by taking the matching degree corresponding to the at least two receptive fields as a weight to obtain a second density image of the image to be processed;

In one possible implementation manner, the number acquisition unit is configured to sum the density values in the second density map to obtain the number of people contained in the image to be processed.

In one possible implementation, the feature extraction, crowd density acquisition, matching degree acquisition, and number of people acquisition steps are performed by an image processing model;

The image processing model comprises a feature extraction network, at least two branch networks and a synthesis module; the feature extraction network is used for extracting features of the image to be processed to obtain depth features of the image to be processed; the at least two branch networks are used for executing crowd density acquisition steps based on at least two different receptive fields; the integration module is used for executing the matching degree acquisition step and the number of people acquisition step.

In one possible implementation, the image processing model is acquired based on a sample image and a target density map corresponding to the sample image; the target density map is obtained based on the following process:

Summing the at least two first response maps to obtain a second response map;

inputting the sample image into an image processing model, and extracting the characteristics of the sample image by the image processing model to obtain the sample depth characteristics of the sample image; based on at least two different receptive fields and the sample depth characteristics, crowd densities of different image areas in the image to be processed are obtained, and at least two sample first density maps corresponding to the at least two receptive fields are obtained; acquiring the matching degree between different image areas in the sample image and the at least two receptive fields according to the sample depth characteristics; weighting crowd densities of different image areas in the at least two sample first density maps by taking the matching degree corresponding to the at least two receptive fields as a weight to obtain a sample second density map;

And updating the model parameters of the image processing model based on the predicted loss value and the comprehensive loss value until the model parameters meet the target conditions, and stopping to obtain the trained image processing model.

According to the device provided by the embodiment of the application, when the distance conditions of different image areas in the image are considered to be different, the sizes of people's heads may be different, for each image area in the image, after the image characteristics are extracted, the crowd density is obtained according to different receptive fields, then the distance of the image area is determined by analyzing the different image areas in the image, and the crowd density of the image area is dynamically selected and adjusted by selecting the appropriate receptive fields, so that the obtained crowd density is more in accordance with the distance degree of the image area, the crowd density obtaining precision is improved, and the accuracy of the number of people is further improved.

It should be noted that: the image processing apparatus provided in the above embodiment is exemplified by the above-described division of the respective functional modules when processing an image, and in practical application, the above-described functional allocation can be performed by different functional modules as needed, that is, the internal structure of the image processing apparatus is divided into different functional modules to perform all or part of the functions described above. In addition, the image processing apparatus and the image processing method provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

The electronic device in the method embodiment described above can be implemented as a terminal. For example, fig. 16 is a block diagram of a terminal according to an embodiment of the present application. The terminal 1600 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3) player, an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminal 1600 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, etc.

In general, terminal 1600 includes: a processor 1601, and a memory 1603.

Processor 1601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 1601 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array) GATE ARRAY, PLA (Programmable Logic Array ). The processor 1601 may also include a host processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1601 may be integrated with a GPU (Graphics Processing Unit, image processor) for rendering and rendering of content to be displayed by the display screen. In some embodiments, the processor 1601 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.

Memory 1603 may include one or more computer-readable storage media, which may be non-transitory. Memory 1603 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1603 is used to store at least one instruction for execution by processor 1601 to implement an image processing method provided by a method embodiment of the present application.

In some embodiments, terminal 1600 may also optionally include: a peripheral interface 1603, and at least one peripheral. The processor 1601, the memory 1603, and the peripheral interface 1603 may be connected by a bus or signal lines. The individual peripheral devices may be connected to the peripheral device interface 1603 by buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1604, a display screen 1605, a camera assembly 1606, audio circuitry 1607, and a power supply 1609.

Peripheral interface 1603 may be used to connect I/O (Input/Output) related at least one peripheral to processor 1601 and memory 1603. In some embodiments, the processor 1601, memory 1603, and peripheral interface 1603 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 1601, memory 1603, and peripheral interface 1603 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 1604 is used for receiving and transmitting RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1604 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1604 converts electrical signals to electromagnetic signals for transmission, or converts received electromagnetic signals to electrical signals. Optionally, the radio frequency circuit 1604 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuit 1604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (3G, 4G, and 5G), wireless local area networks, and/or WiFi (WIRELESS FIDELITY ) networks. In some embodiments, the radio frequency circuit 1604 may further include NFC (NEAR FIELD Communication) related circuits, which are not limited by the present application.

The display screen 1605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 1605 is a touch display, the display 1605 also has the ability to collect touch signals at or above the surface of the display 1605. The touch signal may be input to the processor 1601 as a control signal for processing. At this point, the display 1605 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 1605 may be one and disposed on the front panel of the terminal 1600; in other embodiments, the display 1605 may be at least two, each disposed on a different surface of the terminal 1600 or in a folded configuration; in other embodiments, the display 1605 may be a flexible display disposed on a curved surface or a folded surface of the terminal 1600. Even more, the display screen 1605 may be arranged in an irregular pattern other than rectangular, i.e., a shaped screen. The display screen 1605 may be made of LCD (Liquid CRYSTAL DISPLAY), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 1606 is used to capture images or video. Optionally, camera assembly 1606 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 1606 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

Audio circuitry 1607 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 1601 for processing, or inputting the electric signals to the radio frequency circuit 1604 for voice communication. The microphone may be provided in a plurality of different locations of the terminal 1600 for stereo acquisition or noise reduction purposes. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 1601 or the radio frequency circuit 1604 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, audio circuitry 1607 may also include a headphone jack.

A power supply 1609 is used to power the various components in the terminal 1600. The power supply 1609 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery. When the power supply 1609 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1600 also includes one or more sensors 1610. The one or more sensors 1610 include, but are not limited to: an acceleration sensor 1611, a gyro sensor 1613, a pressure sensor 1613, an optical sensor 1615, and a proximity sensor 1616.

The acceleration sensor 1611 may detect the magnitudes of accelerations on three coordinate axes of a coordinate system established with the terminal 1600. For example, the acceleration sensor 1611 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 1601 may control the display screen 1605 to display a user interface in a landscape view or a portrait view based on the gravitational acceleration signal acquired by the acceleration sensor 1611. The acceleration sensor 1611 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 1613 may detect a body direction and a rotation angle of the terminal 1600, and the gyro sensor 1613 may collect 3D motion of the user to the terminal 1600 in cooperation with the acceleration sensor 1611. The processor 1601 may implement the following functions based on the data collected by the gyro sensor 1613: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.

Pressure sensor 1613 may be disposed on a side frame of terminal 1600 and/or on an underlying layer of display 1605. When the pressure sensor 1613 is disposed at a side frame of the terminal 1600, a grip signal of the terminal 1600 by a user may be detected, and the processor 1601 performs a left-right hand recognition or a quick operation according to the grip signal collected by the pressure sensor 1613. When the pressure sensor 1613 is disposed at the lower layer of the display screen 1605, the processor 1601 performs control on an operability control on the UI interface according to a pressure operation of the display screen 1605 by a user. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.

The optical sensor 1615 is used to collect ambient light intensity. In one embodiment, the processor 1601 may control the display brightness of the display screen 1605 based on the ambient light intensity collected by the optical sensor 1615. Specifically, when the intensity of the ambient light is high, the display brightness of the display screen 1605 is turned up; when the ambient light intensity is low, the display brightness of the display screen 1605 is turned down. In another embodiment, the processor 1601 may also dynamically adjust the capture parameters of the camera module 1606 based on the ambient light intensity collected by the optical sensor 1615.

A proximity sensor 1616, also referred to as a distance sensor, is typically provided on the front panel of the terminal 1600. The proximity sensor 1616 is used to collect a distance between a user and the front surface of the terminal 1600. In one embodiment, when the proximity sensor 1616 detects that the distance between the user and the front face of the terminal 1600 is gradually decreasing, the processor 1601 controls the display 1605 to switch from the bright screen state to the off screen state; when the proximity sensor 1616 detects that the distance between the user and the front surface of the terminal 1600 gradually increases, the processor 1601 controls the display 1605 to switch from the off-screen state to the on-screen state.

Those skilled in the art will appreciate that the structure shown in fig. 16 is not limiting and that more or fewer components than shown may be included or certain components may be combined or a different arrangement of components may be employed.

The electronic device in the above-described method embodiment can be implemented as a server. For example, fig. 17 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1700 may have a relatively large difference due to different configurations or performances, and may include one or more processors (Central Processing Units, CPU) 1701 and one or more memories 1703, where at least one program code is stored in the memories 1703, and the at least one program code is loaded and executed by the processors 1701 to implement the image processing methods provided in the above-mentioned method embodiments. Of course, the server can also have components such as a wired or wireless network interface and an input/output interface for inputting and outputting, and can also include other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, a computer readable storage medium, for example a memory, comprising at least one program code executable by a processor to perform the image processing method of the above embodiment is also provided. For example, the computer readable storage medium can be Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), compact disk Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), magnetic tape, floppy disk, optical data storage device, and the like.

In an exemplary embodiment, a computer program product or a computer program is also provided, the computer program product or the computer program comprising one or more program codes, the one or more program codes being stored in a computer readable storage medium. The one or more processors of the electronic device are capable of reading the one or more program codes from the computer-readable storage medium, the one or more processors executing the one or more program codes so that the electronic device can perform the above-described image processing method.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

It should be understood that determining B from a does not mean determining B from a alone, but can also determine B from a and/or other information.

Those of ordinary skill in the art will appreciate that all or a portion of the steps implementing the above-described embodiments can be implemented by hardware, or can be implemented by a program instructing the relevant hardware, and the program can be stored in a computer readable storage medium, and the above-mentioned storage medium can be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only of alternative embodiments of the application and is not intended to limit the application, but any modifications, equivalents, improvements, etc. which fall within the spirit and principles of the application are intended to be included in the scope of the application.

Claims

1. An image processing method, the method comprising:

Acquiring an image to be processed;

Acquiring intermediate features of different image areas in the image to be processed according to the depth features; normalizing the intermediate features of the different image areas to obtain probability distribution of the different image areas in the image to be processed, wherein the probability distribution of one image area is used for representing the matching degree between the image area and the at least two receptive fields, and one probability value in the probability distribution is used for representing the matching degree between the image area and the target receptive field;

2. The method according to claim 1, wherein the obtaining the crowd density of different image areas in the image to be processed based on the at least two receptive fields and the depth features, and obtaining at least two first density maps corresponding to the at least two receptive fields, includes:

And carrying out convolution processing on the depth features based on at least two of convolution layers, deformable convolution layers, starting structures or residual structures with different void ratios respectively to obtain at least two first density maps corresponding to the at least two receptive fields.

3. The method according to claim 1, wherein the feature extraction of the image to be processed to obtain depth features of the image to be processed comprises:

And carrying out convolution processing on the image to be processed based on at least one continuous convolution layer, and taking the output of the last convolution layer as the depth characteristic.

4. A method according to claim 3, wherein the number of the at least one convolution layer is at least two; the at least one convolution layer adopts jump linkage; the two jumping link convolution layers comprise a first convolution layer and a second convolution layer, wherein the first convolution layer is used for downsampling the depth characteristics output by the previous first convolution layer; the second convolution layer is configured to upsample the depth feature of the previous second convolution layer output and the depth feature of the connected first convolution layer output.

5. The method according to claim 1, wherein the obtaining intermediate features of different image areas in the image to be processed according to the depth features comprises:

6. The method according to claim 1, wherein the step of integrating the population densities of different image areas in the at least two first density maps with the matching degree corresponding to the at least two receptive fields as a weight to obtain the number of people contained in the image to be processed includes:

Weighting crowd densities of different image areas in the at least two first density images by taking matching degrees corresponding to the at least two receptive fields as weights to obtain a second density image of the image to be processed;

and acquiring the number of people contained in the image to be processed according to the second density map.

7. The method of claim 6, wherein the acquiring the number of people contained in the image to be processed from the second density map includes:

and summing the density values in the second density map to obtain the number of people contained in the image to be processed.

8. The method of claim 1, wherein the steps of feature extraction, crowd density acquisition, matching degree acquisition, and number of people acquisition are performed by an image processing model;

9. The method of claim 8, wherein the image processing model is trained based on the sample image and a target density map corresponding to the sample image; the target density map is obtained based on the following process:

summing the at least two first response maps to obtain a second response map;

10. The method of claim 8, wherein the training process of the image processing model comprises:

Inputting the sample image into an image processing model, and extracting features of the sample image by the image processing model to obtain sample depth features of the sample image; based on at least two different receptive fields and the sample depth features, crowd densities of different image areas in the sample image are obtained, and at least two sample first density maps corresponding to the at least two receptive fields are obtained; acquiring the matching degree between different image areas in the sample image and the at least two receptive fields according to the sample depth characteristics; weighting crowd densities of different image areas in the at least two sample first density maps by taking matching degrees corresponding to the at least two receptive fields as weights to obtain a sample second density map;

11. The method of claim 4, wherein a pooling layer is included between adjacent ones of the at least two first convolution layers; the step of extracting the features of the image to be processed to obtain the depth features of the image to be processed, and the step of:

and after the convolution processing is carried out on at least one first convolution layer, carrying out pooling processing on the intermediate depth feature based on the pooling layer before inputting the extracted intermediate depth feature into the next first convolution layer, so as to obtain the intermediate depth feature input into the next first convolution layer.

12. An image processing apparatus, characterized in that the apparatus comprises:

The image acquisition module is used for acquiring an image to be processed;

The matching degree acquisition module comprises a characteristic acquisition unit and a normalization unit; the feature acquisition unit is used for acquiring intermediate features of different image areas in the image to be processed according to the depth features;

The normalization unit is used for performing normalization processing on the intermediate features of the different image areas to obtain probability distribution of the different image areas in the image to be processed, wherein the probability distribution of one image area is used for representing the matching degree between the image area and the at least two receptive fields, and one probability value in the probability distribution is used for representing the matching degree between the image area and the target receptive field;

13. The apparatus of claim 12, wherein the density acquisition module is configured to perform convolution processing on the depth feature based on at least two of a convolution layer, a deformable convolution layer, an initiation structure, or a residual structure with different void ratios, to obtain at least two first density maps corresponding to the at least two receptive fields.

14. The apparatus of claim 12, wherein the feature extraction module is configured to convolve the image to be processed based on at least one convolution layer in succession, and take the output of the last convolution layer as the depth feature.

15. The apparatus of claim 14, wherein the number of the at least one convolution layer is at least two; the at least one convolution layer adopts jump linkage; the two jumping link convolution layers comprise a first convolution layer and a second convolution layer, wherein the first convolution layer is used for downsampling the depth characteristics output by the previous first convolution layer; the second convolution layer is configured to upsample the depth feature of the previous second convolution layer output and the depth feature of the connected first convolution layer output.

16. The apparatus according to claim 12, wherein the feature acquisition unit is configured to:

17. The apparatus of claim 12, wherein the quantity acquisition module comprises a weighting unit and a quantity acquisition unit;

The weighting unit is used for weighting crowd densities of different image areas in the at least two first density images by taking the matching degree corresponding to the at least two receptive fields as weight to obtain a second density image of the image to be processed;

The number acquisition unit is configured to acquire the number of people included in the image to be processed according to the second density map.

18. The apparatus according to claim 17, wherein the number obtaining unit is configured to perform summation processing on the density values in the second density map to obtain the number of people included in the image to be processed.

19. The apparatus of claim 12, wherein the steps of feature extraction, crowd density acquisition, matching degree acquisition, and number of people acquisition are performed by an image processing model;

20. The apparatus of claim 19, wherein the image processing model is trained based on the sample image and a target density map corresponding to the sample image; the target density map is obtained based on the following process:

summing the at least two first response maps to obtain a second response map;

21. The apparatus of claim 19, wherein the training process of the image processing model comprises:

22. An electronic device comprising one or more processors and one or more memories, the one or more memories having stored therein at least one piece of program code loaded and executed by the one or more processors to implement the image processing method of any of claims 1-11.

23. A computer readable storage medium having stored therein at least one program code loaded and executed by a processor to implement the image processing method of any one of claims 1 to 11.

24. A computer program product, characterized in that it comprises one or more program codes stored in a computer readable storage medium, which one or more program codes are readable and executable by one or more processors of an electronic device from the computer readable storage medium, so that the electronic device can perform the image processing method according to any one of claims 1 to 11.