CN112115900A

CN112115900A - Image processing method, device, equipment and storage medium

Info

Publication number: CN112115900A
Application number: CN202011020045.7A
Authority: CN
Inventors: 王昌安
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2020-12-22
Anticipated expiration: 2040-09-24
Also published as: CN112115900B

Abstract

The application discloses an image processing method, an image processing device, image processing equipment and a storage medium, and belongs to the technical field of artificial intelligence. According to the embodiment of the application, the situation that the distance of different image areas in an image is different is considered, the size of the head of a person is possibly different, for each image area in the image, after image features are extracted, the crowd density is obtained according to different receptive fields, the distance of the image area is determined by analyzing the different image areas in the image, the suitable receptive fields are dynamically selected to adjust the crowd density of the image area, the obtained crowd density is more consistent with the distance degree of the image area, the crowd density obtaining precision is improved, and the accuracy of the number of people is further improved.

Description

Image processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to an image processing method, apparatus, device, and storage medium.

Background

With the development of artificial intelligence technology, in more and more fields, the data processing method is realized by simulating, extending and expanding human intelligence through a computer, and then the data can be processed automatically to replace manpower, so that the data processing efficiency is improved.

Wherein, processing the image based on artificial intelligence is an application of the artificial intelligence technology. In one scenario, an image can be processed to determine a population density in the image and thus the number of people in the image. Currently, an image processing method generally extracts features of an image, predicts a density map of each image using a prediction network based on the extracted features of the image, and determines the number of people based on the density map.

The sizes of human heads are different when the distance conditions of different image areas in the images are different, the image processing method does not consider the difference, and the predicted density map is smaller in precision and poorer in accuracy.

Disclosure of Invention

The embodiment of the application provides an image processing method, device and equipment and a storage medium, which can improve the estimation precision and accuracy.

In one aspect, an image processing method is provided, and the method includes:

acquiring an image to be processed;

performing feature extraction on the image to be processed to obtain the depth feature of the image to be processed;

acquiring the crowd density of different image areas in the image to be processed based on at least two receptive fields and the depth characteristics to obtain at least two first density maps corresponding to the at least two receptive fields, wherein the at least two receptive fields are different from each other;

according to the depth features, obtaining matching degrees between different image areas in the image to be processed and the at least two receptive fields;

and taking the matching degrees corresponding to the at least two receptive fields as weights, and integrating the crowd densities of different image areas in the at least two first density maps to obtain the number of people contained in the image to be processed.

In one possible implementation, two adjacent first convolution layers of the at least two first convolution layers include a pooling layer therebetween;

the performing feature extraction on the image to be processed to obtain the depth feature of the image to be processed further includes:

after convolution processing is carried out on at least one first convolution layer, before the extracted intermediate depth features are input into the next first convolution layer, pooling processing is carried out on the intermediate depth features based on a pooling layer, and the intermediate depth features input into the next first convolution layer are obtained.

In a possible implementation manner, the classifying the intermediate features of the different image regions to obtain probability distributions of the different image regions in the image to be processed includes:

and carrying out normalization processing on the intermediate features of the different image areas to obtain the probability distribution of the different image areas.

In one aspect, an image processing apparatus is provided, the apparatus including:

the image acquisition module is used for acquiring an image to be processed;

the feature extraction module is used for extracting features of the image to be processed to obtain depth features of the image to be processed;

a density acquisition module, configured to acquire, based on at least two receptive fields and the depth features, population densities of different image regions in the image to be processed to obtain at least two first density maps corresponding to the at least two receptive fields, where the at least two receptive fields are different from each other;

the matching degree acquisition module is used for acquiring the matching degree between different image areas in the image to be processed and the at least two receptive fields according to the depth characteristics;

and the quantity acquisition module is used for integrating the crowd densities of different image areas in the at least two first density maps by taking the matching degrees corresponding to the at least two receptive fields as weights to obtain the quantity of people contained in the image to be processed.

In a possible implementation manner, the density obtaining module is configured to perform convolution processing on the depth feature based on at least two of a convolutional layer, a deformable convolutional layer, an initiation (initiation) structure, or a residual structure with different void rates, respectively, to obtain at least two first density maps corresponding to at least two receptive fields.

In a possible implementation manner, the feature extraction module is configured to perform convolution processing on the image to be processed based on at least one continuous convolution layer, and take an output of a last convolution layer as the depth feature.

In one possible implementation, the number of the at least one convolutional layer is at least two; the at least one convolutional layer adopts a jump link; the two convolutional layers of the jump link comprise a first convolutional layer and a second convolutional layer, and the first convolutional layer is used for downsampling the depth characteristics output by the previous first convolutional layer; the second convolutional layer is used for up-sampling the depth characteristics output by the previous second convolutional layer and the depth characteristics output by the connected first convolutional layer.

In one possible implementation manner, the matching degree obtaining module includes a feature obtaining unit and a normalizing unit;

the feature acquisition unit is used for acquiring intermediate features of different image areas in the image to be processed according to the depth features;

the normalization unit is used for performing normalization processing on the intermediate features of the different image areas to obtain the matching degree between the different image areas and the at least two receptive fields in the image to be processed.

In a possible implementation manner, the normalization unit is configured to perform normalization processing on the intermediate features of the different image regions to obtain probability distributions of the different image regions in the image to be processed, a probability distribution of one image region is used to represent a matching degree between the image region and the at least two receptive fields, and a probability value in the probability distribution is used to represent a matching degree between the image region and a target receptive field.

In one possible implementation, the feature obtaining unit is configured to:

carrying out average pooling on the depth features to obtain the depth features of the different image areas;

and performing convolution processing on the depth features of the different image areas respectively to obtain the intermediate features of the different image areas.

In one possible implementation manner, the number obtaining module includes a weighting unit and a number obtaining unit;

the weighting unit is used for weighting the crowd densities of different image areas in the at least two first density maps by taking the matching degrees corresponding to the at least two receptive fields as weights to obtain a second density map of the image to be processed;

the number obtaining unit is used for obtaining the number of the people contained in the image to be processed according to the second density map.

In a possible implementation manner, the number obtaining unit is configured to perform summation processing on the density values in the second density map to obtain the number of people included in the image to be processed.

In one possible implementation, the steps of feature extraction, crowd density acquisition, matching degree acquisition and number of people acquisition are performed by an image processing model;

the image processing model comprises a feature extraction network, at least two branch networks and a comprehensive module; the feature extraction network is used for extracting features of the image to be processed to obtain depth features of the image to be processed; the at least two branch networks are used for executing a crowd density obtaining step based on at least two different receptive fields; the integration module is used for executing the matching degree obtaining step and the number obtaining step.

In one possible implementation, the image processing model is obtained based on a sample image and a target density map corresponding to the sample image; the target density map is obtained based on the following process:

obtaining a sample image and positions of at least two human heads in the sample image;

generating at least two first response graphs according to the positions of the at least two heads, wherein the pixel values of the positions of the at least two heads in the first response graphs are 1, and the pixel values of other positions are 0;

summing the at least two first response graphs to obtain a second response graph;

and performing Gaussian convolution processing on the second response image to obtain the target density image.

In one possible implementation, the training process of the image processing model includes:

obtaining sample images, wherein each sample image corresponds to a target density map;

inputting a sample image into an image processing model, and performing feature extraction on the sample image by the image processing model to obtain a sample depth feature of the sample image; acquiring the crowd density of different image areas in the image to be processed based on at least two different receptive fields and the sample depth characteristics to obtain at least two sample first density maps corresponding to the at least two receptive fields; according to the sample depth features, obtaining matching degrees between different image areas in the sample image and the at least two receptive fields; weighting the crowd densities of different image areas in the at least two sample first density graphs by taking the matching degrees corresponding to the at least two receptive fields as weights to obtain a sample second density graph;

respectively acquiring a prediction loss value and a comprehensive loss value corresponding to the at least two different receptive fields based on the target density map, the first sample density map and the second sample density map of the sample image;

and updating the model parameters of the image processing model based on the prediction loss value and the comprehensive loss value until the model parameters meet the target condition, so as to obtain the trained image processing model.

In one aspect, an electronic device is provided that includes one or more processors and one or more memories having at least one program code stored therein, the at least one program code being loaded into and executed by the one or more processors to implement various alternative implementations of the above-described image processing method.

In one aspect, a computer-readable storage medium is provided, in which at least one program code is stored, which is loaded and executed by a processor to implement various alternative implementations of the image processing method described above.

In one aspect, a computer program product or computer program is provided that includes one or more program codes stored in a computer-readable storage medium. One or more processors of the electronic device can read the one or more program codes from the computer-readable storage medium, and the one or more processors execute the one or more program codes, so that the electronic device can execute the image processing method of any one of the above possible embodiments.

According to the embodiment of the application, the situation that the distance of different image areas in an image is different is considered, the size of the head of a person is possibly different, for each image area in the image, after image features are extracted, the crowd density is obtained according to different receptive fields, the distance of the image area is determined by analyzing the different image areas in the image, the suitable receptive fields are dynamically selected to adjust the crowd density of the image area, the obtained crowd density is more consistent with the distance degree of the image area, the crowd density obtaining precision is improved, and the accuracy of the number of people is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to be able to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic view of a receptive field provided by an embodiment of the present application;

fig. 2 is a schematic diagram of an implementation environment of an image processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an image processing process in a monitored scene according to an embodiment of the present disclosure;

fig. 4 is a flowchart of an image processing method provided in an embodiment of the present application;

fig. 5 is a flowchart of an image processing method provided in an embodiment of the present application;

FIG. 6 is a schematic diagram of a feature extraction network provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of a hole convolution according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a convolutional layer with void fraction according to an embodiment of the present disclosure;

fig. 9 is a schematic diagram of an initiation structure provided in an embodiment of the present application;

fig. 10 is a schematic diagram of a sampling manner of a normal convolution and a deformable convolution with a convolution kernel size of 3 × 3 according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a residual structure provided by an embodiment of the present application;

FIG. 12 is a schematic diagram of a process for density estimation based on depth features according to an embodiment of the present disclosure;

FIG. 13 is a schematic diagram of a process for density estimation based on depth features according to an embodiment of the present disclosure;

FIG. 14 is a schematic diagram of an image before and after processing according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

fig. 16 is a block diagram of a terminal according to an embodiment of the present disclosure;

fig. 17 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," and the like in this application are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency or limitation on the number or order of execution. It will be further understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first image can be referred to as a second image, and similarly, a second image can be referred to as a first image without departing from the scope of various described examples. The first image and the second image can both be images, and in some cases, can be separate and distinct images.

The term "at least one" is used herein to mean one or more, and the term "plurality" is used herein to mean two or more, e.g., a plurality of packets means two or more packets.

It is to be understood that the terminology used in the description of the various described examples herein is for the purpose of describing particular examples only and is not intended to be limiting. As used in the description of the various described examples and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. The term "and/or" is an associative relationship that describes an associated object, meaning that three relationships can exist, e.g., a and/or B, can mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present application generally indicates that the former and latter related objects are in an "or" relationship.

It should also be understood that, in the embodiments of the present application, the size of the serial number of each process does not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

It should also be understood that determining B from a does not mean determining B from a alone, but can also determine B from a and/or other information.

It will be further understood that the terms "Comprises," "Comprising," "inCludes" and/or "inCluding," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also understood that the term "if" may be interpreted to mean "when" ("where" or "upon") or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined." or "if [ a stated condition or event ] is detected" may be interpreted to mean "upon determining.. or" in response to determining. "or" upon detecting [ a stated condition or event ] or "in response to detecting [ a stated condition or event ]" depending on the context.

The following is a description of terms involved in the present application.

Receptive field: in the convolutional neural network, the definition of a Receptive Field (Receptive Field) is the area size of a pixel point on a feature map (feature map) output by each layer of the convolutional neural network, which is mapped on an input picture. Colloquial point interpretation, a receptive field is a point on a feature map that corresponds to a region on an input image, as shown in FIG. 1. After convolution processing, a point on the later feature map corresponds to a region on the earlier feature map. If the receptive field is large, the area is large, the semantic level of the features in the feature map is higher, and the feature map has more global property; and if the receptive field is small, the area is smaller, the semantic level of the features in the feature map is lower, and the feature map has more locality.

Up-sampling and down-sampling: in the process of down-sampling, the characteristics of a picture are extracted, actually, key parts of the picture are extracted, the resolution of the picture is reduced, and the picture is reduced; in the upsampling process, methods are used to restore the size of the picture and improve the resolution of the picture, and any technique that can make the picture become high resolution is called upsampling.

Bilinear interpolation: also known as bilinear interpolation. Bilinear interpolation includes nearest neighbor interpolation, bilinear interpolation, bicubic interpolation, and the like. For linear interpolation, linear interpolation is an interpolation method for one-dimensional data, which performs numerical estimation based on two data points adjacent to the point to be interpolated in the one-dimensional data sequence on the left and right, and assigns their specific gravity based on the distance to the two points. For bilinear interpolation, it can be understood as two-step linear interpolation: firstly, interpolation is carried out in the x direction, and secondly, interpolation is carried out in the y direction by using the interpolation result in the x direction. Bilinear interpolation is one way of image scaling.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application relates to the technologies of image processing, image semantic understanding and the like in artificial intelligence computer vision and the technologies of neural network learning and the like in machine learning, and is specifically explained by the following embodiment.

The following describes an embodiment of the present application.

Fig. 2 is a schematic diagram of an implementation environment of an image processing method according to an embodiment of the present application. The implementation environment includes a terminal 101, or the implementation environment includes a terminal 101 and an image processing platform 102. The terminal 101 is connected to the image processing platform 102 through a wireless network or a wired network.

The terminal 101 can be at least one of a smartphone, a game console, a desktop computer, a tablet computer, an e-book reader, an MP3(moving picture expertsgroupaudio layer iii, moving picture experts compression standard audio layer 3) player, or an MP4(moving picture expertsgroupidioialayer iv, moving picture experts compression standard audio layer 4) player, a laptop. The terminal 101 is installed and run with an application program that supports image processing, which can be, for example, a system application, a shopping application, an online video application, a social application.

Illustratively, the terminal 101 can have an image capturing function and an image processing function, and can process a captured image and execute the corresponding function according to the processing result. The terminal 101 can independently complete the work and can also provide data services for the terminal through the image processing platform 102. The embodiments of the present application do not limit this.

The image processing platform 102 includes at least one of a server, a plurality of servers, a cloud computing platform, and a virtualization center. The image processing platform 102 is used to provide background services for image processing applications. Optionally, the image processing platform 102 undertakes primary processing, and the terminal 101 undertakes secondary processing; or, the image processing platform 102 undertakes the secondary processing work, and the terminal 101 undertakes the primary processing work; alternatively, the image processing platform 102 or the terminal 101 can be separately provided with processing work. Alternatively, the image processing platform 102 and the terminal 101 perform cooperative computing by using a distributed computing architecture.

Optionally, the image processing platform 102 includes at least one server 1021 and a database 1022, where the database 1022 is used for storing data, and in this embodiment, the database 1022 can store sample images to provide data services for the at least one server 1021.

The server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, big data and artificial intelligence platform. The terminal can be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like.

Those skilled in the art will appreciate that the number of the terminals 101 and the servers 1021 can be greater or smaller. For example, the number of the terminals 101 and the servers 1021 may be only one, or the number of the terminals 101 and the servers 1021 may be several tens or several hundreds, or more, and the number of the terminals or the servers and the device types are not limited in the embodiment of the present application.

The image processing method can be applied to any people counting scene, for example, in a monitoring scene, the image shot by monitoring can be processed, and the number of people contained in the image can be determined. For example, as shown in fig. 3, a camera 301 may be provided in some large-scale places, and the camera 301 monitors the place in real time and sends a captured image or video 302 to an image processing platform 303. Further, the image processing platform 303 can process the shot image or video 302, determine the number of people in any frame or each frame of the image or video, and monitor the number of people in the place.

Fig. 4 is a flowchart of an image processing method provided in an embodiment of the present application, where the method is applied to an electronic device, where the electronic device is a terminal or a server, and referring to fig. 4, the method includes the following steps.

401. The electronic device obtains an image to be processed.

402. And the electronic equipment performs feature extraction on the image to be processed to obtain the depth feature of the image to be processed.

A feature is a characteristic of an object being different from another object. By extracting the features of the image to be processed and referring to the features of the image to be processed by the depth features, the image content can be analyzed, and the crowd density can be determined.

403. The electronic equipment obtains the crowd density of different image areas in the image to be processed based on at least two receptive fields and the depth characteristics to obtain at least two first density maps corresponding to the at least two receptive fields, wherein the at least two receptive fields are different from each other.

After the depth features are obtained, the electronic device may perform population density estimation on the depth features in different receptive fields, and for the same image region, the population densities obtained through different receptive fields may have differences. Thus, the at least two first density maps comprise the population densities acquired with different precisions for the image area.

404. And the electronic equipment acquires the matching degree between different image areas in the image to be processed and the at least two receptive fields according to the depth characteristics.

The matching degree is used for indicating the adaptation degree of the receptive field and the image area, the crowd density determined by the small receptive field for some image areas is more accurate, and the crowd density determined by the large receptive field for some image areas is more accurate. The matching degree is used to measure the accuracy of a certain image region determined by various receptive fields.

405. And the electronic equipment uses the matching degrees corresponding to the at least two receptive fields as weights, and integrates the crowd densities of different image areas in the at least two first density maps to obtain the number of people contained in the image to be processed.

When the first density maps obtained based on different receptive fields are obtained, the adaptable receptive fields are different due to different distance conditions of different image areas, the crowd density in at least two first density maps is integrated by taking the matching degree as a basis, the crowd density obtained by selecting the appropriate receptive field for each image area can be dynamically selected according to the image content, and thus the number of determined people is more accurate and the accuracy is higher.

Fig. 5 is a flowchart of an image processing method provided in an embodiment of the present application, and referring to fig. 5, the method includes the following steps.

501. The electronic device obtains an image to be processed.

In the embodiment of the application, the electronic device has an image processing function, and can process the image to be processed to determine the number of people contained in the image.

In some embodiments, the image to be processed may include one or more persons, and the electronic device may process the image to be processed to determine the number of persons included in the image to be processed. In other embodiments, the image to be processed may not include a person, and the electronic device may determine that the number of persons is zero after processing the image.

The electronic device may acquire the image to be processed in a variety of ways. The electronic device may be a terminal, or a server.

In some embodiments, the electronic device is a terminal. In a possible implementation manner of the embodiment, the terminal has an image capturing function, and the terminal can capture an image as the image to be processed. In another possible implementation of the embodiment, the terminal can download an image from a target website as the image to be processed. In another possible implementation manner of the embodiment, the terminal may extract an image from an image database as the image to be processed. In another possible implementation manner of the embodiment, the terminal may acquire an imported image as the image to be processed in response to an image import operation.

In other embodiments, the electronic device may be a server. In one possible implementation of this embodiment, the server is capable of receiving images captured and transmitted by the terminal. In another possible implementation of this embodiment, the server may also be downloaded from the target website. In another possible implementation manner of the embodiment, the server may also extract an image from an image database as the image to be processed.

The above description provides only a few possible obtaining manners of the image to be processed, and of course, the terminal or the server may also obtain the image to be processed in other manners.

502. The electronic device inputs the image to be processed into the image processing model.

After the electronic device acquires the image to be processed, the image to be processed needs to be processed. In this embodiment, the image processing method can be implemented by an image processing model, and the electronic device may call the image processing model, input the image to be processed into the image processing model, and execute the subsequent image processing steps by the image processing model. In other embodiments, the electronic device may also directly perform a subsequent image processing step, which is not implemented based on the image processing model, and this is not limited in this embodiment of the present application.

For the image processing model, in some embodiments, the image processing model may be trained in the electronic device. In other embodiments, the image processing model may be sent to the electronic device by other electronic devices after training on the other electronic devices is completed, and the electronic device may call the image processing model during image processing. The embodiment of the present application does not limit on which device the training process of the image processing model is performed.

The following explains the image processing model.

In one possible implementation, the image processing model may include that the image processing model includes a feature extraction network, at least two branch networks, and an integration module. In the image processing process, the feature extraction network is configured to perform feature extraction on the image to be processed to obtain a depth feature of the image to be processed, that is, the following step 503; the at least two branch networks are used for performing a population density obtaining step based on at least two different receptive fields, namely the following step 504; the integration module is used for executing the matching degree obtaining step and the number of people obtaining step, that is, the following steps 505 to 507.

The image processing model is obtained by training based on the sample image and the target density map corresponding to the sample image. The target density map is a real density map of the sample image, and the pixel value of each pixel point in the target density map is the crowd density at the pixel point. And training the image processing model by taking the target density map as a true value, so that the trained image processing model can process the sample image to obtain the target density map, or the output result is very close to the target density map, thereby improving the image processing capability of the image processing model and accurately processing the image.

For the target density map, the target density map can be determined according to the position of the head in the sample image. In some embodiments, the target density map is obtained based on steps one through four described below.

The method comprises the steps that firstly, electronic equipment obtains a sample image and positions of at least two human heads in the sample image.

In the first step, the positions of all the human heads in the sample image can be determined, and the positions of all the human heads are used as the basis for determining the target density map.

In some embodiments, the center point of the head may be used as the location of the head. In this embodiment, if the center point of the head is in the sample image, the head may be counted as one of the persons included in the sample image. If the center point of the head is not in the sample image, the head is not counted as one of the persons included in the sample image. For example, if only a small part of the head is included in the sample image and the center position of the head is not in the sample image, the small part of the head is not counted when the number of persons included in the sample image is calculated.

The positions of the at least two heads may be obtained by labeling by a related technician, or may be obtained by target detection, or by combining a manual labeling method on the basis of the target detection, which is not limited in the embodiment of the present application.

And step two, the electronic equipment generates at least two first response graphs according to the positions of the at least two heads, wherein the pixel values of the positions of the at least two heads in the first response graphs are 1, and the pixel values of other positions are 0.

In the second step, the first response graph can reflect whether each pixel point contains the center point of the head, and if so, the pixel value of the pixel point is 1. If not, the pixel value of the pixel point is 0. It should be noted that, although only "1" represents inclusion and "0" represents non-inclusion, it is needless to say that "0" represents inclusion and "1" represents non-inclusion, and this is not limited in the examples of the present application. Whether each pixel point contains the center point of the head or not can reflect the density of the head to a certain extent.

And step three, the electronic equipment sums the at least two first response graphs to obtain a second response graph.

In the third step, the first response map reflects that the corresponding first response map is generated for the position of each human head, and the electronic device may further sum the plurality of first response maps, and synthesize the plurality of first response maps to obtain the second response map of the sample image.

And fourthly, the electronic equipment performs Gaussian convolution processing on the second response image to obtain the target density image.

The crowd density is concentrated at the center point of the head in the second response image, the electronic equipment can perform Gaussian convolution processing on the second response image, the crowd density can be dispersed into the pixel points around the center point of the head, and then the real crowd density of each pixel point of the image is determined. For each human head, assuming that the contribution value of the human head to the density of the surrounding pixel points is attenuated according to a gaussian function, the gaussian convolution processing can be adopted when the second response image is processed, that is, the second response image is processed through the gaussian convolution kernel.

For example, in one particular example, the learning goal of the model is a population density distribution thermodynamic diagram, referred to herein simply as a density map or thermodynamic diagram. The thermodynamic diagram reflects the average number of people in the corresponding position of the unit pixel in the actual scene (i.e. the sample image), and in order to generate the crowd density diagram (i.e. the target density diagram), the target density diagram of the whole diagram is generated for the N head center points x1, …, xn in the diagram by considering the N head center points x1, …, xn. For each head center xi, we generate a two-dimensional response map Hi (i.e. the first response map), where only the pixel value of the head center is 1, and the rest positions are 0. Wherein N, N, i are positive integers. And then adding Hi corresponding to the central points of all the heads to obtain a response graph H (namely a second response graph) of all the heads in the original graph, wherein obviously, the integral value of the response graph H is the total number of people. Then, for each human head, we assume that its contribution to the density of surrounding pixels decays as a gaussian function, and therefore, a normalized gaussian kernel G can be used_σThe response map H is convolved to obtain a density map D (i.e., a target density map).

The process is the process of obtaining the target density map, when the image processing model is used, the density map can be obtained through prediction of the image processing model, and then the total number of people can be obtained through integration. The training process of the image processing model is also the model parameters for training the image processing model, so that the difference between the predicted density map obtained by predicting the image processing model and the target density map is small and is as close as possible.

In some embodiments, the training process of the image processing model may be implemented through steps one through four.

The method comprises the following steps that firstly, the electronic equipment obtains sample images, and each sample image corresponds to a target density map. The target density map is a true, correct density map.

Secondly, the electronic equipment inputs the sample image into an image processing model, and the image processing model performs feature extraction on the sample image to obtain sample depth features of the sample image; acquiring the crowd density of different image areas in the image to be processed based on at least two different receptive fields and the depth characteristics of the sample to obtain at least two sample first density maps corresponding to the at least two receptive fields; according to the sample depth feature, obtaining the matching degree between different image areas in the sample image and the at least two receptive fields; and weighting the crowd densities of different image areas in the at least two sample first density maps by taking the matching degrees corresponding to the at least two receptive fields as weights to obtain a sample second density map.

In the second step, the electronic device can input the sample image into the image processing model, and the image processing model executes a series of steps to determine a sample second density map of each sample image, where a process of determining the sample second density map is the same as the following steps 503 to 506, and redundant description is omitted here, and for details, refer to the subsequent steps.

In the process of determining the second density map of the sample, the sample image is analyzed according to the depth features of the sample, population density estimation is performed according to different receptive fields, the first density map is obtained, and the receptive field which is more suitable for the far and near conditions of each image area in the sample image is also analyzed, so that the first density map is integrated, and the more accurate second density map is determined.

And thirdly, the electronic equipment respectively obtains the prediction loss value and the comprehensive loss value corresponding to the at least two different receptive fields based on the target density map, the first sample density map and the second sample density map of the sample image.

In the third step, the prediction loss value is determined based on the target density map and the first sample density map, and the prediction loss value is obtained for the prediction result of each receptive field, so that in the training process, training can be performed, and the prediction capability under different receptive fields is improved. The comprehensive loss value is determined based on the target density map and the second density map of the sample, and the comprehensive loss value can be trained to improve the accuracy of the comprehensive prediction result after the prediction results of various different receptive fields are synthesized.

In one possible implementation, the electronics may obtain the predicted loss value and the integrated loss value based on a MSE (Mean Square Error) loss function.

Specifically, the electronic device may obtain the predicted loss value or the integrated loss value through the following formula one.

Wherein L is_regRepresenting the loss value, N representing the total number of pixels in the training image,

is the real value of the density map (i.e. the density value in the target density map), z, of the ith pixel point_iFor the predicted value of the network (i.e., the density value in the sample first density map or the sample second density map), the optimization of the network finally makes the distribution thermodynamic map predicted by the network (i.e., the density value in the sample first density map or the sample second density map) as close as possible to the real density map.

And fourthly, updating the model parameters of the image processing model by the electronic equipment based on the prediction loss value and the comprehensive loss value until the model parameters meet the target conditions, and obtaining the trained image processing model.

The first step to the third step are iterative processes, and after the model parameters are updated, the iterative processes can be continuously executed until the training target is reached. The target condition may be convergence of the predicted loss value and the integrated loss value, or the number of model iterations reaches the target number, which is not limited.

503. The electronic equipment performs feature extraction on the image to be processed based on a feature extraction network of the image processing model to obtain the depth features of the image to be processed.

After the image to be processed is input into the image processing model by the electronic equipment, the depth feature of the image to be processed can be extracted by the image processing model, the information of each image area in the image to be processed is represented by the depth feature, and then regression processing can be carried out aiming at the depth feature to determine the crowd density.

In some embodiments, the feature extraction network is a main network of the image processing model, and after the main network extracts the depth features, the depth features can be input into each branch network, and the branch networks process the depth features respectively.

In one possible implementation, the feature extraction network includes at least one convolutional layer in succession. The electronic device may perform convolution processing on the image to be processed based on the at least one convolution layer in succession, taking an output of the last convolution layer as the depth feature. The continuous at least one convolution layer is adjacent to each other.

In some embodiments, the output of a previous convolutional layer serves as the input of a subsequent convolutional layer. In other embodiments, the output of the previous convolutional layer may be further processed as the input of the next convolutional layer. The depth features can accurately represent information contained in the image to be processed after convolution processing is performed on the basis of at least one continuous convolution layer.

Wherein the number of the at least one convolution layer is one or more (i.e. at least two). In some embodiments, the number of the at least one convolutional layer is at least two, i.e., a plurality.

In some embodiments, the at least one convolutional layer may employ a skip link, wherein two convolutional layers of the skip link include a first convolutional layer and a second convolutional layer, the first convolutional layer being used for downsampling a depth feature output by a previous first convolutional layer; the second convolutional layer is used for up-sampling the depth characteristics output by the previous second convolutional layer and the depth characteristics output by the connected first convolutional layer.

The depth features output by the convolution layers which are compared in the front can be comprehensively compared with the depth features output by the convolution layers which are compared in the back through jump linkage to be used as the input of a certain convolution layer, so that the input of the certain convolution layer comprises the context features which are obtained through the convolution processing of the plurality of convolution layers in one step and have high semantic information and the local detail information. As can be understood, through the jump link, the detail information can be introduced during the up-sampling, and the extracted depth features are more perfect and more accurate.

In one possible implementation, two adjacent first convolution layers of the at least two first convolution layers include a pooling layer therebetween, and spatial down-sampling is implemented by the pooling layer. Therefore, in the above-mentioned feature extraction process, after the convolution processing is performed on the at least one first convolution layer, before the extracted intermediate depth feature is input into the next first convolution layer, the electronic device further performs the pooling processing on the intermediate depth feature based on the pooling layer to obtain the intermediate depth feature input into the next first convolution layer.

For example, in one particular example, the feature extraction network may employ a VGG16 network. The feature extraction network may employ a downsampled and then upsampled U-shaped network structure. As shown in fig. 6, the left part of fig. 6 is the VGG16 front end network for down sampling. The first convolutional layer is "ConvBlock" and the second convolutional layer is "convolutional layer". For ConvBlock1 through ConvBlock4, each ConvBlock consists of multiple successive convolutional layers. For ConvBlock1 through ConvBlock4, the number of internal convolution layers was 2, 2, 3, respectively. The number of channels for all convolutions in the same ConvBlock is identical. For ConvBlock1 through ConvBlock4, the number of convolution channels is 64, 128, 256, 512, respectively. The right part in fig. 6 is an up-sampling part. The upsampled convolutional layer can integrate the output of the previous convolutional layer and the output of the ConvBlock of the jump link, and the integrated result is used as the input of the convolutional layer. The integration process may be an element-by-element addition process. Alternatively, the upsampling process may be a bilinear upsampling process, and linear interpolation is performed in the X, Y direction, respectively, to fill in the depth features obtained by the downsampling, so as to obtain final depth features. Spatial down-sampling is realized between each ConvBlock through maximum pooling (Maxpool), so that the network receptive field and local translation are increased without deformation.

504. The electronic equipment inputs the depth characteristics of the image to be processed into at least two branch networks, the at least two branch networks acquire the crowd density of different image areas in the image to be processed based on at least two receptive fields and the depth characteristics, at least two first density maps corresponding to the at least two receptive fields are obtained, and the at least two receptive fields are different from each other.

Each branch network is used for processing the depth characteristics based on a receptive field or a receptive field range, acquiring the crowd density of different image areas, and obtaining a first density map of the receptive field or the receptive field range. And obtaining first density maps corresponding to different receptive fields through a plurality of branch networks. Compared with the mode of carrying out crowd density estimation by using a fixed receptive field in the related technology, the method considers the situation that the crowd density estimation needs to be carried out by using different receptive fields because the sizes of human heads are possibly different when the distance and the near conditions of different image areas in the image are different, thereby improving the crowd density estimation precision.

For the at least two branch networks, the at least two different receptive fields can be realized by different network structures, for example, the at least two branch networks can include convolutional layers of different void rates. For another example, one of the at least two branching networks may be a deformable convolutional layer. For another example, one of the at least two branched networks may be an origination (initiation) structure. For another example, one of the at least two branch networks may be a residual structure. The embodiment of the application can be used for randomly combining the branch network structures, and only enough difference between at least two branch networks is required to be ensured, and each branch can be ensured to be capable of learning the expression capacity suitable for different scales in a self-adaptive manner. The number of the branch networks and the structure of each branch network are not limited in the embodiments of the present application.

In step 504, the electronic device may perform convolution processing on the depth feature based on at least two of the convolutional layer, the deformable convolutional layer, the initiation (initiation) structure, or the residual structure with different cavity rates, respectively, to obtain at least two first density maps corresponding to at least two receptive fields.

In one specific example, each branch network may be referred to as each prediction head, each prediction head being used to predict a first density map of a receptive field or a range of receptive fields. Each prediction head can adopt any one of the structures to obtain the first density maps corresponding to different receptive fields through different convolution processing.

The following explains the above-described convolutional layer having a void rate, deformable convolutional layer, initiation (initiation) structure, or residual structure.

For convolutional layers with a void rate, the convolutional layers with the void rate can perform void convolution on input data. The hole convolution is to inject holes into a standard convolution kernel so as to increase the receptive field. Compared with the original normal convolution operation, the hole convolution has one more super parameter, which is also the hole rate (contrast rate). The void rate refers to the number of intervals of the kernel. For example, as shown in fig. 7, taking 3 × 3 convolution kernels as an example, 0 is filled in the middle of the 3 × 3 convolution kernels, and sampling is performed at intervals. When the number of the intervals is different, the void ratio is different.

As shown in fig. 8, one or more prediction headers of the at least two prediction headers may adopt the structure of the convolutional layer with a void rate, that is, one or some branch networks are 3 × 3 convolutional layers with three void rates d. Wherein d is a positive number.

For an interception structure, the interception structure includes multiple branches, each branch (which may be regarded as multiple filters) adopts a different convolution Layer, performs convolution processing of different scales on the output of the Previous Layer (Previous Layer), and finally integrates the results of the multiple branches (i.e., filters) to obtain a first density map corresponding to the receptive field in a very small range.

For example, FIG. 9 shows an initiation structure. A branch network is an initiation structure as shown in fig. 9, and the initiation structure has two roles: one is to use a convolution of 1x1 for the lifting dimension; the second is to perform convolution re-aggregation on multiple sizes simultaneously. Thus, the concept structure can process the depth features to obtain the estimation result of the crowd density based on the fact that a receptive field is located in a minimum range.

For the initiation structure, the explanation is as follows:

for the multiple 1 × 1 convolutional layers in fig. 9, which have the effect of superimposing more convolutions in the same size of the field of view, more abundant features can be extracted. The three 1x1 convolutions in fig. 9 all serve this purpose. The 1x1 convolutional layer can also play a role in reducing dimension, and the computational complexity is reduced. Both the middle 3x3 convolution and the 1x1 convolution before the 5x5 convolution of fig. 9 play a role in dimensionality reduction. When the characteristic number of a certain convolution layer input is more, performing convolution operation on the input generates huge calculation amount; if the dimension reduction is performed on the input, the amount of convolution calculation after the characteristic number is reduced is obviously reduced.

In fig. 9, 4 branches are made on the input depth features, and the input depth features are respectively convolved or pooled by filters (filters) with different sizes, and finally spliced together in the feature dimension. Therefore, the convolution is simultaneously carried out on a plurality of scales in the visual sense, and the features of different scales can be extracted. The richer features also mean that the final regression judgment is more accurate. And the convergence speed is accelerated by utilizing the principle of decomposing the sparse matrix into the dense matrix for calculation. One branch in the initiation module uses max pooling, which also serves to extract features.

Fig. 9 shows only one kind of initiation structure, and the initiation structure may also be in other forms, and a person skilled in the relevant art may adjust the convolution layer in the initiation structure according to the requirement. The embodiments of the present application do not limit this.

For deformable convolutional layers, the deformable convolutional layers are obtained by adding an offset to the normal convolutional layers, the added offset being part of the network structure. With the offset, the size and the position of the deformable convolution kernel can be dynamically adjusted according to the image content which needs to be identified or classified currently, and the visual effect is that the positions of sampling points of the convolution kernels at different positions can be adaptively changed according to the image content, so that the adaptive adjustment method is adaptive to the geometric deformation of shapes, sizes and the like of different objects, and the adaptive adjustment of receptive fields is realized.

The sampling manner of the normal convolution and the deformable convolution with the convolution kernel size of 3 × 3 is explained below with fig. 10. As shown in fig. 10, (a) in fig. 10 is a normal convolutional layer, and (b) and (c) in fig. 10 are variable convolutional layers. The normal convolution regularly samples 9 points, and (b) and (c) add a shift amount (the shift amount is shown by an arrow and can also be called an offset) to the normal sampling coordinates.

For residual structure, residual refers in mathematical statistics to the difference between the actual observed value and the estimated value (fitted value). The residual structure can obtain an actual observed value and an estimated value aiming at a certain variable through a jump connection or identity mapping mode, and further obtain a residual. For example, the optimization target of a certain network may be h (x) ═ f (x) + x, and the optimization target can be converted from h (x) to h (x) — x by a residual structure. At this time, the upper layers are not trained to be equivalent mapping, but are close to 0, so that the training difficulty can be effectively reduced. For example, as shown in fig. 11, a certain branch network has a residual structure, and by introducing a residual through the residual structure, a population density map obtained based on a very thin receptive field can be obtained.

505. And the electronic equipment acquires the matching degree between different image areas in the image to be processed and the at least two receptive fields according to the depth characteristic based on the comprehensive module of the image processing model.

After the electronic equipment acquires the first density maps corresponding to different receptive fields, the electronic equipment can use the comprehensive module to analyze the image content, and dynamically integrate the first density maps based on the analysis result. During integration, the image content can be analyzed through the depth features, and the matching degree between the distance condition of different image areas (which can be smaller) and different receptive fields is determined. Specifically, the step 505 can be implemented by a step one and a step two.

Step one, the electronic equipment acquires intermediate features of different image areas in the image to be processed according to the depth features.

The electronic device may process the depth features to convert them into intermediate features corresponding to the number of receptive fields. Specifically, the electronic device may perform average pooling on the depth features to obtain the depth features of the different image areas, and perform convolution on the depth features of the different image areas to obtain intermediate features of the different image areas.

For example, as shown in fig. 12, the electronic device may change the feature map into k × k cells by an adaptive average pooling module, and then perform feature transformation on each cell by a 1 × 1 convolution layer, where each cell predicts n values, corresponding to n prediction bits.

And step two, the electronic equipment performs normalization processing on the intermediate features of the different image areas to obtain the matching degree between the different image areas and the at least two receptive fields in the image to be processed.

In some embodiments, the matching degree may be in a mode of probability distribution, specifically, the electronic device may perform normalization processing on the intermediate features of the different image regions to obtain probability distributions of the different image regions in the image to be processed, a probability distribution of one image region is used to represent the matching degree between the image region and the at least two receptive fields, and a probability value in the probability distribution is used to represent the matching degree between the image region and the target receptive field.

For example, as shown in fig. 12, after obtaining the intermediate features, the electronic device may acquire probability distributions corresponding to n prediction heads using softmax. As shown in fig. 13, after the depth features are extracted by the feature extraction network using each branch as a prediction head, the corresponding first density map can be obtained by performing prediction by using n prediction heads.

The above is merely an exemplary illustration, and the electronic device may also adopt a more complex or efficient structure, such as adding a multi-branched context modeling module between the average pooling and the 1x1 convolution, such as an inclusion structure. The embodiments of the present application do not limit this.

506. And the electronic equipment weights the crowd densities of different image areas in the at least two first density maps by taking the matching degrees corresponding to the at least two receptive fields as weights based on an image processing model comprehensive module to obtain a second density map of the image to be processed.

The electronic equipment determines the matching degree, and analyzes how large the receptive field is more suitable for each image area in the image to be processed. The distance of the person in different image areas may be different, and the size of the head of the person is different, so that the receptive field is different. By taking the matching degree as the weight, a more suitable receptive field can be determined for each image area, and the pixel value of each image area in the weighted second density map is closer to the pixel value in the first density map obtained based on the more suitable receptive field.

The acquisition process of the second density map is accurate to the image area, and compared with a mode of directly estimating the crowd density of the whole image based on the fixed receptive field in the related technology, the density estimation precision is greatly improved, the accuracy of the second density map is improved, and the number of people determined by the following steps is more accurate.

The above description has been given by taking an example that the integration module adaptively integrates the first density maps corresponding to the at least two receptive fields according to the matching degree to obtain the second density map, and the obtaining process of the second density map may also be implemented in other manners, for example, the electronic device may select a first density map region of one receptive field for each image region according to the matching degree, and then combine the first density map regions corresponding to each image region to obtain the second density map, which is not limited in the embodiment of the present application.

507. And the electronic equipment acquires the number of people contained in the image to be processed according to the second density map based on an image processing model integration module.

After the electronic device obtains the second density map, the second density map can accurately reflect the crowd density of each pixel point in the image to be processed, namely the average number of people of each pixel point, and the number of people contained in the image to be processed, namely the total number of people in the image to be processed, can be obtained based on the second density map.

In some embodiments, the electronic device may sum the density values in the second density map to obtain the number of people included in the image to be processed. In this step 507, the electronics sum the density values in the second density map to obtain a sum of all density values. Because each density value represents the crowd density of each pixel point, namely the average number of people of each pixel point, the average number of people of all the pixel points is summed, namely the number of people contained in the image to be processed can be obtained.

In other embodiments, the summation by the electronic device of the density values in the second density map can be performed by integral calculation, i.e. the summation can be performed by integrating the density values in the second density map along the abscissa or the ordinate.

It should be noted that, in the

above steps

506 and 507, the matching degrees corresponding to the at least two receptive fields are used as weights, and the crowd densities of different image areas in the at least two first density maps are integrated to obtain the number of people included in the image to be processed, and the above processes are only described by taking as an example a process of obtaining the total number of people by weighting the first density maps through the matching degrees to obtain a second density map and then obtaining the total number of people based on the second density map. In some embodiments, the process may also be implemented in other manners, for example, the electronic device may select a first density map region of a receptive field for each image region according to the matching degree, and then combine regions in the first density map corresponding to each image region to obtain a second density map, so as to obtain the total number of people. For another example, the electronic device may directly perform weighted summation on the density values in the first density map according to the matching degree to obtain the total number of people, and the embodiment of the present application does not limit what specific method is used.

In a possible implementation manner, after the step 506 is performed, after the electronic device obtains the second density map based on the image processing model, the electronic device may also output the second density map, and display the second density map, or display each image region in the image to be processed according to a display style corresponding to the crowd density of each image region in the second density map. The display mode of each image area in the image to be processed corresponds to the density value in the second density map.

For example, as shown in fig. 14, by applying the image processing method provided by the present application, the graph (a) in fig. 14 can be processed to determine the total number of people in the image, for example, the total number of people is 208, and the density of people in the image can be displayed according to the determined second density map, as shown in the graph (b) in fig. 14. The headcount 208 is shown in (b) where the crowd density may be different for different image areas, and may be displayed in different colors, illustrated in (b) with different patterns instead of colors.

The foregoing steps 502 to 507 have been described only by taking an example of processing an image to be processed by an image processing model, and in some embodiments, the image processing method may not be implemented by an image processing model. In this embodiment, after acquiring the image to be processed, the electronic device may directly perform the image processing steps similar to those in steps 502 to 507, but without the need of an image processing model, and the embodiment of the present application does not limit what kind of method is specifically adopted.

All the above optional technical solutions can be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

Fig. 15 is a schematic structural diagram of an image processing apparatus provided in an embodiment of the present application, and referring to fig. 15, the apparatus includes:

an image acquisition module 1501, configured to acquire an image to be processed;

a feature extraction module 1502, configured to perform feature extraction on the image to be processed to obtain a depth feature of the image to be processed;

a density obtaining module 1503, configured to obtain, based on at least two receptive fields and the depth feature, population densities of different image regions in the image to be processed to obtain at least two first density maps corresponding to the at least two receptive fields, where the at least two receptive fields are different from each other;

a matching degree obtaining module 1504, configured to obtain matching degrees between different image areas in the image to be processed and the at least two receptive fields according to the depth feature;

the number obtaining module 1505 is configured to use the matching degrees corresponding to the at least two receptive fields as weights, and synthesize the crowd densities of different image regions in the at least two first density maps to obtain the number of people included in the image to be processed.

In a possible implementation manner, the density obtaining module 1503 is configured to perform convolution processing on the depth feature based on at least two items of a convolutional layer, a deformable convolutional layer, a start initiation structure, or a residual structure with different void rates, respectively, to obtain at least two first density maps corresponding to at least two receptive fields.

In one possible implementation, the feature extraction module 1502 is configured to perform convolution processing on the image to be processed based on at least one convolution layer in succession, and output of a last convolution layer is taken as the depth feature.

In one possible implementation, the number of the at least one convolutional layer is at least two; the at least one convolutional layer employs a hopping link; the two convolutional layers of the jump link comprise a first convolutional layer and a second convolutional layer, and the first convolutional layer is used for downsampling the depth characteristics output by the previous first convolutional layer; the second convolutional layer is used for up-sampling the depth characteristics output by the previous second convolutional layer and the depth characteristics output by the connected first convolutional layer.

In one possible implementation, the matching degree obtaining module 1504 includes a feature obtaining unit and a normalizing unit;

In one possible implementation, the feature obtaining unit is configured to:

carrying out average pooling on the depth features to obtain the depth features of different image areas;

In one possible implementation, the number obtaining module 1505 includes a weighting unit and a number obtaining unit;

the number acquiring unit is used for acquiring the number of the persons contained in the image to be processed according to the second density map.

In one possible implementation, the steps of feature extraction, crowd density acquisition, matching degree acquisition, and number of people acquisition are performed by an image processing model;

the image processing model comprises a feature extraction network, at least two branch networks and a comprehensive module; the feature extraction network is used for extracting features of the image to be processed to obtain depth features of the image to be processed; the at least two branch networks are used for executing a crowd density obtaining step based on at least two different receptive fields; the integration module is used for executing the matching degree obtaining step and the human number obtaining step.

acquiring a sample image and positions of at least two human heads in the sample image;

inputting a sample image into an image processing model, and performing feature extraction on the sample image by the image processing model to obtain a sample depth feature of the sample image; acquiring the crowd density of different image areas in the image to be processed based on at least two different receptive fields and the depth characteristics of the sample to obtain at least two sample first density maps corresponding to the at least two receptive fields; according to the sample depth feature, obtaining the matching degree between different image areas in the sample image and the at least two receptive fields; weighting the crowd densities of different image areas in the at least two sample first density graphs by taking the matching degrees corresponding to the at least two receptive fields as weights to obtain a sample second density graph;

respectively obtaining a prediction loss value and a comprehensive loss value corresponding to the at least two different receptive fields based on the target density map, the first sample density map and the second sample density map of the sample image;

and updating the model parameters of the image processing model based on the prediction loss value and the comprehensive loss value until the model parameters meet the target conditions, so as to obtain the trained image processing model.

The device that this application embodiment provided, it is different to consider the far and near condition of different image areas in the image, the size of people head is probably different, to every image area in the image, after extracting image characteristic, crowd density has been obtained according to different receptive fields, the rethread is to different image areas in the image analysis, confirm the distance of this image area, the crowd density of this image area has been adjusted to the suitable receptive field of dynamic selection, the crowd density who obtains like this accords with the far and near degree of image area more, crowd density acquisition precision has been improved, and then the accuracy of the quantity of people has been improved.

It should be noted that: in the image processing apparatus provided in the above embodiment, when processing an image, only the division of the above functional modules is taken as an example, and in practical applications, the above function allocation can be completed by different functional modules according to needs, that is, the internal structure of the image processing apparatus is divided into different functional modules so as to complete all or part of the above described functions. In addition, the image processing apparatus and the image processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

The electronic device in the above method embodiment can be implemented as a terminal. For example, fig. 16 is a block diagram of a terminal according to an embodiment of the present disclosure. The terminal 1600 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 (movingpictureexpeerpropergroupidiolayorylayer iii, motion picture experts compressed standard audio layer 3) player, an MP4 (movingpictureexpeppergroupidiolayorylayer iv, motion picture experts compressed standard audio layer 4) player, a notebook computer, or a desktop computer. Terminal 1600 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

Generally, terminal 1600 includes: a processor 1601, and a memory 1603.

Processor 1601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 1601 may be implemented in at least one hardware form of DSP (digital signal processing), FPGA (Field programmable gate array), PLA (programmable logic array). The processor 1601 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1601 may be integrated with a GPU (graphics processing unit) that is responsible for rendering and rendering content that the display screen needs to display. In some embodiments, the processor 1601 may further include an AI (artificial intelligence) processor for processing computing operations related to machine learning.

Memory 1603 may include one or more computer-readable storage media, which may be non-transitory. Memory 1603 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1603 is used to store at least one instruction for execution by processor 1601 to implement the image processing method provided by the method embodiments herein.

In some embodiments, the terminal 1600 may also optionally include: peripheral interface 1603 and at least one peripheral. The processor 1601, the memory 1603 and the peripheral interface 1603 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 1603 via buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of a radio frequency circuit 1604, a display 1605, a camera assembly 1606, audio circuitry 1607, a positioning assembly 1608, and a power supply 1609.

Peripheral interface 1603 may be used to connect at least one I/O (Input/Output) related peripheral to processor 1601 and memory 1603. In some embodiments, processor 1601, memory 1603, and peripheral interface 1603 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1601, the memory 1603 and the peripheral interface 1603 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.

The radio frequency circuit 1604 is used for receiving and transmitting RF (radio frequency) signals, also called electromagnetic signals. The radio frequency circuitry 1604 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 1604 converts the electrical signal into an electromagnetic signal to be transmitted, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1604 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, various generations of mobile communication networks (3G, 4G, and 5G), wireless local area networks, and/or WiFi (wireless fidelity) networks. In some embodiments, the rf circuit 1604 may further include NFC (near field communication) related circuits, which are not limited in this application.

The display screen 1605 is used to display a UI (user interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1605 is a touch display screen, the display screen 1605 also has the ability to capture touch signals on or over the surface of the display screen 1605. The touch signal may be input to the processor 1601 as a control signal for processing. At this point, the display 1605 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 1605 can be one, disposed on the front panel of the terminal 1600; in other embodiments, the display screens 1605 can be at least two, respectively disposed on different surfaces of the terminal 1600 or in a folded design; in other embodiments, display 1605 can be a flexible display disposed on a curved surface or a folded surface of terminal 1600. Even further, the display 1605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display 1605 may be made of LCD (liquid crystal display), OLED (organic light-emitting diode), or the like.

The camera assembly 1606 is used to capture images or video. Optionally, camera assembly 1606 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each of the rear cameras is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (virtual reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1606 can also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 1607 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1601 for processing or inputting the electric signals to the radio frequency circuit 1604 to achieve voice communication. For stereo sound acquisition or noise reduction purposes, the microphones may be multiple and disposed at different locations of terminal 1600. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1601 or the radio frequency circuit 1604 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuit 1607 may also include a headphone jack.

The positioning component 1608 is configured to locate a current geographic location of the terminal 1600 for navigation or LBS (location based service). The positioning component 1608 may be a positioning component based on the GPS (global positioning system) of the united states, the beidou system of china, or the galileo system of russia.

Power supply 1609 is used to provide power to the various components of terminal 1600. Power supply 1609 may be alternating current, direct current, disposable or rechargeable. When power supply 1609 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1600 also includes one or more sensors 1610. The one or more sensors 1610 include, but are not limited to: acceleration sensor 1611, gyro sensor 1613, pressure sensor 1613, fingerprint sensor 1614, optical sensor 1615, and proximity sensor 1616.

Acceleration sensor 1611 may detect acceleration in three coordinate axes of a coordinate system established with terminal 1600. For example, the acceleration sensor 1611 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 1601 may control the display screen 1605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1611. The acceleration sensor 1611 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 1613 may detect a body direction and a rotation angle of the terminal 1600, and the gyro sensor 1613 and the acceleration sensor 1611 may cooperate to collect a 3D motion of the user on the terminal 1600. The processor 1601 may perform the following functions according to the data collected by the gyro sensor 1613: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 1613 may be disposed on the side frames of terminal 1600 and/or underlying display 1605. When the pressure sensor 1613 is disposed on the side frame of the terminal 1600, a user's holding signal of the terminal 1600 can be detected, and the processor 1601 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 1613. When the pressure sensor 1613 is disposed at the lower layer of the display 1605, the processor 1601 controls the operability control on the UI interface according to the pressure operation of the user on the display 1605. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1614 is configured to collect a fingerprint of the user, and the processor 1601 is configured to identify the user based on the fingerprint collected by the fingerprint sensor 1614, or the fingerprint sensor 1614 is configured to identify the user based on the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 1601 authorizes the user to perform relevant sensitive operations including unlocking a screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 1614 may be disposed on the front, back, or side of the terminal 1600. When a physical key or vendor Logo is provided on the terminal 1600, the fingerprint sensor 1614 may be integrated with the physical key or vendor Logo.

The optical sensor 1615 is used to collect ambient light intensity. In one embodiment, the processor 1601 may control the display brightness of the display screen 1605 based on the ambient light intensity collected by the optical sensor 1615. Specifically, when the ambient light intensity is high, the display luminance of the display screen 1605 is increased; when the ambient light intensity is low, the display brightness of the display screen 1605 is adjusted down. In another embodiment, the processor 1601 may also dynamically adjust the shooting parameters of the camera assembly 1606 based on the ambient light intensity collected by the optical sensor 1615.

A proximity sensor 1616, also referred to as a distance sensor, is typically disposed on the front panel of terminal 1600. The proximity sensor 1616 is used to collect the distance between the user and the front surface of the terminal 1600. In one embodiment, the processor 1601 controls the display 1605 to switch from the light screen state to the clear screen state when the proximity sensor 1616 detects that the distance between the user and the front surface of the terminal 1600 is gradually decreased; when the proximity sensor 1616 detects that the distance between the user and the front surface of the terminal 1600 is gradually increased, the display 1605 is controlled by the processor 1601 to switch from the breath-screen state to the bright-screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 16 is not intended to be limiting of terminal 1600, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be employed.

The electronic device in the above method embodiment can be implemented as a server. For example, fig. 17 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1700 may generate a relatively large difference due to different configurations or performances, and can include one or more processors (CPUs) 1701 and one or more memories 1703, where the memory 1703 stores at least one program code, and the at least one program code is loaded and executed by the processors 1701 to implement the image processing method provided by each method embodiment. Certainly, the server can also have components such as a wired or wireless network interface and an input/output interface to facilitate input and output, and the server can also include other components for implementing the functions of the device, which is not described herein again.

In an exemplary embodiment, there is also provided a computer-readable storage medium, such as a memory, including at least one program code, the at least one program code being executable by a processor to perform the image processing method in the above-described embodiments. For example, the computer-readable storage medium can be a Read-only memory (ROM), a Random Access Memory (RAM), a compact disc Read-only memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product or a computer program is also provided, which comprises one or more program codes, which are stored in a computer-readable storage medium. The one or more processors of the electronic device can read the one or more program codes from the computer-readable storage medium, and the one or more processors execute the one or more program codes, so that the electronic device can perform the image processing method described above.

It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

It should be understood that determining B from a does not mean determining B from a alone, but can also determine B from a and/or other information.

Those skilled in the art will appreciate that all or part of the steps for implementing the above embodiments can be implemented by hardware, or can be implemented by a program for instructing relevant hardware, and the program can be stored in a computer readable storage medium, and the above mentioned storage medium can be read only memory, magnetic or optical disk, etc.

The above description is intended only to be an alternative embodiment of the present application, and not to limit the present application, and any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. An image processing method, characterized in that the method comprises:

acquiring an image to be processed;

2. The method according to claim 1, wherein the obtaining of the population density of different image regions in the image to be processed based on the at least two receptive fields and the depth feature to obtain at least two first density maps corresponding to the at least two receptive fields comprises:

and performing convolution processing on the depth features respectively based on at least two items of convolution layers with different void rates, a deformable convolution layer, an initial structure or a residual structure to obtain at least two first density maps corresponding to at least two receptive fields.

3. The method according to claim 1, wherein the performing feature extraction on the image to be processed to obtain the depth feature of the image to be processed comprises:

and performing convolution processing on the image to be processed based on at least one continuous convolution layer, and taking the output of the last convolution layer as the depth feature.

4. The method of claim 3, wherein the number of the at least one convolutional layer is at least two; the at least one convolutional layer adopts a jump link; the two convolutional layers of the jump link comprise a first convolutional layer and a second convolutional layer, and the first convolutional layer is used for downsampling the depth characteristics output by the previous first convolutional layer; the second convolutional layer is used for up-sampling the depth characteristics output by the previous second convolutional layer and the depth characteristics output by the connected first convolutional layer.

5. The method according to claim 1, wherein the obtaining the matching degree between different image areas in the image to be processed and the at least two receptive fields according to the depth features comprises:

acquiring intermediate features of different image areas in the image to be processed according to the depth features;

and normalizing the intermediate features of the different image areas to obtain the matching degree between the different image areas and the at least two receptive fields in the image to be processed.

6. The method according to claim 5, wherein the normalizing the intermediate features of the different image regions to obtain the matching degrees between the different image regions and the at least two receptive fields in the image to be processed comprises:

and normalizing the intermediate features of the different image regions to obtain probability distributions of the different image regions in the image to be processed, wherein the probability distribution of one image region is used for expressing the matching degree between the image region and the at least two receptive fields, and one probability value in the probability distribution is used for expressing the matching degree between the image region and a target receptive field.

7. The method according to claim 5, wherein the obtaining intermediate features of different image areas in the image to be processed according to the depth features comprises:

8. The method according to claim 1, wherein the obtaining the number of people included in the image to be processed by using the matching degrees corresponding to the at least two receptive fields as weights and integrating the crowd densities of different image areas in the at least two first density maps comprises:

weighting the crowd densities of different image areas in the at least two first density maps by taking the matching degrees corresponding to the at least two receptive fields as weights to obtain a second density map of the image to be processed;

and acquiring the number of people contained in the image to be processed according to the second density map.

9. The method according to claim 8, wherein the obtaining the number of persons included in the image to be processed according to the second density map comprises:

and summing the density values in the second density map to obtain the number of people contained in the image to be processed.

10. The method of claim 1, wherein the steps of feature extraction, crowd density acquisition, matching degree acquisition, and number of people acquisition are performed by an image processing model;

11. The method of claim 10, wherein the image processing model is trained based on the sample image and a target density map corresponding to the sample image; the target density map is obtained based on the following process:

12. The method of claim 10, wherein the training process of the image processing model comprises:

13. An image processing apparatus, characterized in that the apparatus comprises:

the image acquisition module is used for acquiring an image to be processed;

14. An electronic device, comprising one or more processors and one or more memories having at least one program code stored therein, the at least one program code being loaded and executed by the one or more processors to implement the image processing method of any one of claims 1 to 12.

15. A computer-readable storage medium, characterized in that at least one program code is stored in the storage medium, which is loaded and executed by a processor to implement the image processing method according to any one of claims 1 to 12.