CN111310805A

CN111310805A - Method, device and medium for predicting density of target in image

Info

Publication number: CN111310805A
Application number: CN202010074908.2A
Authority: CN
Inventors: 梁延研; 于晓渊; 林旭新
Original assignee: China Energy International Construction Investment Group Co ltd
Current assignee: Boyan Technology Zhuhai Co ltd
Priority date: 2020-01-22
Filing date: 2020-01-22
Publication date: 2020-06-19
Anticipated expiration: 2040-01-22
Also published as: CN111310805B

Abstract

The invention discloses a method, a device and a storage medium for predicting density of an object in an image. The method considers the relation between different feature map channels, adopts three-dimensional discrete cosine transform (3D DCT) and three-dimensional inverse discrete cosine transform (3D IDCT) to construct a frequency feature pyramid, can extract multi-scale frequency information, and does not need to scale the feature map in the feature extraction process, thereby ensuring that the obtained feature map does not lose excessive detail information; frequency multi-scale features are further fused and enhanced through an attention mechanism, so that a high-quality density prediction graph can be finally generated; meanwhile, the designed loss function fully considers the consistency of the local error and the global error, so that better robustness can be achieved when outliers appear in the prediction process. The invention is widely applied to the technical field of image processing.

Description

Method, device and medium for predicting density of target in image

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method, an apparatus, and a storage medium for performing density prediction on an object in an image.

Background

The density estimation of dense crowds and traffic flow is one of important technologies in most relevant applications of smart cities, the density estimation can be used for counting the number of actual people and passing vehicles, and the distribution of crowds and traffic flow can be displayed, so that the density estimation plays an important role in many practical applications such as crowd management, safety protection, city planning, consumer consumption behavior analysis, municipal traffic planning and the like. For example, in some scenes, a large amount of crowds gathering at famous tourist attractions or meeting events in holidays can cause trampling accidents, which can bring great harm to people; the result of the crowd counting in the waiting hall can optimize the dispatching of public transportation; the change in the population in a certain area may result in an accident or be the result of an accident; the consumer consumption behavior pattern can be analyzed by the counting results of people in different periods of time in the shopping mall; the method can analyze the customers frequently suffering from periodical congestion on a certain road section, plan municipal traffic, optimize scheduling and the like.

At present, the method for estimating the density of the target in the image is mainly a traditional method based on bottom layer feature fitting regression and a method based on deep learning. The traditional method based on bottom layer feature fitting regression firstly needs to manually design and extract various features, then trains a linear or nonlinear function based on the features to return a density map, and generally comprises three main steps: foreground segmentation, feature extraction and density map estimation. The foreground segmentation aims to segment people or traffic flow from an image so as to facilitate subsequent feature extraction, and the segmentation performance directly influences the final counting precision, so that the foreground segmentation is an important factor limiting the performance of the traditional algorithm; and the feature extraction is to extract various bottom layer features from the segmented foreground, and the density estimation is to regress the extracted features to the crowd or traffic flow distribution in the image. Features in the image must be abstracted and expressed before the regression model is built. Feature expression is generally related to feature extraction, selection and transformation of underlying visual attributes by constructing intermediate inputs into a regression model to estimate the distribution of people or traffic in an image. However, the three steps are independent from each other, separate optimization is needed, the connection between the steps is lacked, and the optimization of the whole body by forming a resultant force is difficult. The deep learning-based method can automatically learn from big data through a deep neural network to obtain the high-efficiency expression of the target characteristics, and the well learned characteristics can greatly improve the system performance. Although current methods based on deep learning have made key breakthroughs in the task of density estimation, there are still key problems that are difficult to solve completely: target features in an image are easily interfered by a plurality of external uncertain factors, for example, the appearance of a target is easily influenced by scales, postures, visual angles and the like, the target in the same scene has a plurality of different scales, and the problems of severe shielding, inconsistency of a density map with actual distribution and the like exist generally.

Disclosure of Invention

To solve at least one of the above problems, it is an object of the present invention to provide a method, an apparatus, and a storage medium for density prediction of an object in an image.

The technical scheme adopted by the invention is as follows: the embodiment of the invention comprises a method for predicting the density of a target in an image, which comprises the following steps:

extracting shallow features of the image to obtain a first feature map;

processing the first characteristic diagram by using a frequency characteristic pyramid model to obtain a plurality of second characteristic diagrams with different scales;

performing convolution processing on the plurality of second feature maps with different scales respectively to obtain a plurality of third feature maps;

fusing the plurality of third feature maps to obtain a fourth feature map;

generating a weight matrix through a softmax function according to the fourth feature map;

and enhancing the weight matrix through an attention mechanism to generate an image target density prediction graph.

Further, the method adopts three-dimensional discrete cosine transform and three-dimensional inverse discrete cosine transform to construct the frequency characteristic pyramid model.

Further, the step of processing the first feature map by using the frequency feature pyramid model specifically includes:

converting the first feature map from a spatial domain to a frequency domain by a three-dimensional discrete cosine transform;

extracting images of a plurality of different frequencies in a frequency domain;

and converting the images with different frequencies into a plurality of second feature maps with different scales through three-dimensional inverse discrete cosine transform.

Further, the three-dimensional discrete cosine transform and the three-dimensional inverse discrete cosine transform are continued transform on the channel dimension of the first feature map after the transform performed in both the column and row directions of the first feature map; the formulas of the three-dimensional discrete cosine transform and the three-dimensional inverse discrete cosine transform are as follows:

(three-dimensional discrete cosine transform);

(three-dimensional inverse discrete cosine transform);

wherein ,

wherein, N represents the number of columns of the first feature map, M represents the number of rows of the first feature map, L represents the number of channels of the first feature map, F (x, y, z) is the feature point at the y-th row and x-th column of the z-th channel, F (u, v, w) is the corresponding frequency feature after discrete cosine transform, and c (u), c (v), and c (w) are the corresponding compensation coefficients.

Further, the attention mechanism enhancing the weight matrix is performed by the following formula:

F_i,c(x)＝(1+H_i,c(x)×G_i,c(x))，

wherein g (x) is the input of the frequency feature pyramid model, h (x) is a weight matrix generated by a softmax function, the range of which is [0, 1], f (x) is the feature after multi-scale information enhancement, i is the ith feature map channel, and c represents the point of the position c on the feature map.

Further, the method further comprises training the frequency feature pyramid model, including:

constructing a training set, wherein the training set is composed of different feature graphs;

inputting the training set into a frequency characteristic pyramid model, and predicting the image target density;

calculating a difference value between the predicted value and the true value by using a loss function;

the loss function is minimized.

Further, the loss function is:

where Y is the actual density map, X is the input image, θ is the parameter of the frequency feature pyramid model, F (X, θ) represents the frequency feature pyramid model, gms (i) refers to the gradient amplitude similarity of the prediction map and the actual map at point i, and N is the total number of pixels in the input image.

Further, the gradient magnitude similarity is performed by the following formula:

wherein ,

wherein c is a normal number, Y^pIs a predicted density map, Y is an actual density map, m_Yp(i) Is the gradient magnitude, m, of the predicted density map at point i_Y(i) Is the gradient amplitude of the actual density map at point i, gms (i) is the similarity of the gradient amplitudes of the predicted and actual density maps at point i,

refers to volume and operation, h_xRefer to the Prewitt operator, h, in the horizontal direction_yRefer to the Prewitt operator in the vertical direction.

In another aspect, embodiments of the present invention further include an apparatus for density prediction of objects in an image, comprising a processor and a memory, wherein,

the memory is to store program instructions;

the processor is used for reading the program instructions in the memory and executing the method for density prediction of the target in the image according to the program instructions in the memory.

In another aspect, the present invention further includes a computer readable storage medium, on which a computer program is stored, which, when executed by a processor, performs the illustrated method for density prediction of an object in an image.

The invention has the beneficial effects that: according to the method, the relationship among different feature graph channels is considered, the frequency feature pyramid is constructed by adopting three-dimensional discrete cosine transform (3D DCT) and three-dimensional inverse discrete cosine transform (3D IDCT), multi-scale frequency information can be extracted, and in the feature extraction process, the feature graph does not need to be scaled, so that the obtained feature graph can be ensured not to lose excessive detail information; frequency multi-scale features are further fused and enhanced through an attention mechanism, so that a high-quality density prediction graph can be finally generated; meanwhile, the designed loss function fully considers the consistency of the local error and the global error, so that better robustness can be achieved when outliers appear in the prediction process.

Drawings

FIG. 1 is a flowchart illustrating steps of a method for density prediction of an object in an image according to an embodiment;

FIG. 2 is a schematic structural diagram of the pyramid model with frequency characteristics in an embodiment;

fig. 3 is a schematic diagram of specific setting of the network parameters in the embodiment.

Detailed Description

Fig. 1 is a flow chart of the steps of a method for density prediction of an object in an image, as shown in fig. 1, the method comprising the processing steps of:

s1, extracting shallow features of an image to obtain a first feature map;

s2, processing the first feature map by using a frequency feature pyramid model to obtain a plurality of second feature maps with different scales;

s3, performing convolution processing on the plurality of second feature maps with different scales respectively to obtain a plurality of third feature maps;

s4, fusing the plurality of third feature maps to obtain a fourth feature map;

s5, generating a weight matrix through a softmax function according to the fourth feature map;

and S6, enhancing the weight matrix through an attention mechanism to generate an image target density prediction graph.

In this embodiment, the pyramid model of frequency characteristics described in step S2 is constructed by using three-dimensional discrete cosine transform (3DDCT) and three-dimensional inverse discrete cosine transform (3D IDCT), the three-dimensional discrete cosine transform and the three-dimensional inverse discrete cosine transform are popularized by a one-dimensional discrete cosine transform (1D DCT) and a one-dimensional inverse discrete cosine transform (1DIDCT), the one-dimensional discrete cosine transform and the one-dimensional inverse discrete cosine transform are carried out in the column direction of the characteristic diagram, the two-dimensional discrete cosine transform and the two-dimensional inverse discrete cosine transform are carried out in the two directions of the column and the row of the characteristic diagram, the three-dimensional discrete cosine transform and the three-dimensional inverse discrete cosine transform are added with the transform in the dimension of the characteristic diagram channel on the basis of the two-dimensional discrete cosine transform and the two-dimensional inverse discrete cosine transform. The one-dimensional discrete cosine transform and the one-dimensional inverse discrete cosine transform are as follows:

one-dimensional discrete cosine transform:

one-dimensional inverse discrete cosine transform:

wherein ,

wherein N is the total number of original signals, f (x) is the x-th original signal, F (u) is the frequency signal after discrete cosine transform, u is the frequency coefficient, c (u) is the compensation coefficient.

According to the formulas of one-dimensional discrete cosine transform and one-dimensional inverse discrete cosine transform, the formulas of three-dimensional discrete cosine transform (3D DCT) and three-dimensional inverse discrete cosine transform (3D IDCT) can be generalized as follows:

(three-dimensional discrete cosine transform);

(three-dimensional inverse discrete cosine transform);

wherein ,

In the implementation, a Front-end network (Front-end network) is used for extracting shallow features of an image to obtain a first feature map; the obtained first feature map is input to a frequency feature pyramid model for processing, and referring to fig. 2, T in fig. 2 is a three-dimensional discrete cosine transform (3D DCT) and a three-dimensional inverse discrete cosine transform (3D IDCT) operation, C is a convolution operation, W is a softmax operation in order to generate a weight matrix of an attention mechanism, and Concat is a series operation in this dimension of a channel. In this embodiment, after receiving the first feature map, the frequency feature pyramid model converts the first feature map from a spatial domain to a frequency domain through three-dimensional discrete cosine transform to extract a plurality of images with different frequencies in the frequency domain; in this embodiment, 4 images with frequencies of 1/4, 1/16, 1/64, and 1/256 are taken, and are converted into 4 second feature maps with different scales through three-dimensional inverse discrete cosine transform, and correspondingly, four parallel rows of feature maps in fig. 2 are obtained, multi-scale features are further extracted through convolution, then, the feature maps with 4 scales are fused together through a series operation, and then, a softmax function is used to generate a weight matrix. And finally, enhancing the weight matrix through an attention mechanism, specifically, performing pixel-by-pixel multiplication and addition operation on the first feature image to extract high-level semantic features, and generating a high-quality image target density prediction image by using a back-end network (back-end network).

Unlike the conventional multi-scale feature pyramid, the present embodiment represents different scales using different frequencies of the image, and generates a frequency multi-scale feature pyramid model using three-dimensional discrete cosine transform (3D DCT) and three-dimensional inverse discrete cosine transform (3D IDCT). When the frequency feature pyramid is constructed, the feature map does not need to be scaled, so that the obtained feature map can be ensured not to lose excessive detail information. After the discrete cosine transform of the image, the image may be converted from the spatial domain to the frequency domain, and the inverse discrete cosine transform may convert the image from the frequency domain back to the spatial domain, with different frequencies in the frequency domain corresponding to images in the spatial domain without scaling. Therefore, the multi-scale image with the same size as the original image can be obtained after the different frequencies are subjected to the inverse discrete cosine transform. In this embodiment, the pyramid model of frequency features is to convert the input first feature map from the spatial domain to the frequency domain, and then to convert 4 feature maps with frequencies of 1/4, 1/16, 1/64, and 1/256 back to the spatial domain to obtain 4 second feature maps with different scales; performing convolution processing on the 4 second feature maps with different scales respectively to obtain 4 third feature maps with different scales; fusing 4 third feature maps with different scales to obtain a fourth feature map; generating a weight matrix through a softmax function according to the fourth feature map; and enhancing the weight matrix through an attention mechanism to generate an image target density prediction graph.

In this embodiment, the attention mechanism may be defined as:

F_i,c(x)＝(1+H_i,c(x)×G_i,c(x))，

The embodiment also designs a new loss function, mainly considering that most of the existing methods adopt MSE as the loss function, but the MSE is only a pixel-level loss function for calculating the global error. Therefore, the embodiment designs a loss function which keeps consistent globally and locally, called global-local consistency loss function, and the specific form is as follows:

The Log-Cosh error is a loss function similar to MSE, but when an outlier occurs, the Log-Cosh loss function has better robustness, and when the density of a target in different samples of different scenes of different targets is changed greatly, the Log-Cosh loss function has better performance, but only embodies a global error and does not consider a local error, so that the local error is constrained by using a gradient amplitude similarity GMS (i), and the form is as follows:

wherein ,

refers to volume and operation, h_xRefer to the Prewitt operator, h, in the horizontal direction_yRefer to the Prewitt operator in the vertical direction. In this embodiment, the Prewitt operator h_x and h_yIs defined as:

the global-local consistency loss function fully considers the consistency of local errors and global errors, and has better robustness when outliers appear in samples.

In this embodiment, referring to fig. 3, specific settings of network parameters are described by input and output. As shown in fig. 3, the specific setting of the network parameters is composed of three parts, a part is a front-end network of the network, b part is a back-end network of the network, and c is a frequency characteristic pyramid model. In this embodiment, the input of the part a is set as an original RGB image of the crowd, the output of the part a, the input and output of the part b, and the input of the part c are all intermediate feature maps, and the output of the part c is a crowd distribution density map of the final network output. In the front-end network in the part a, Conv3-64-1 refers to a 3x3 convolution kernel with 64 hole factors of 1, Conv3-128-1 refers to a 3x3 convolution kernel with 128 hole factors of 1, and Conv3-256-1 refers to a 3x3 convolution kernel with 256 hole factors of 1; max Pooling stands for maximum Pooling operation. Similarly, in the back-end network in section b, Conv3-512-2 refers to a 3x3 convolution kernel with 512 hole factors of 2, Conv3-256-2 refers to a 3x3 convolution kernel with 256 hole factors of 2, Conv3-128-2 refers to a 3x3 convolution kernel with 128 hole factors of 2, Conv3-64-2 refers to a 3x3 convolution kernel with 64 hole factors of 2, and Conv1-1-1 refers to a 1x1 convolution kernel with 1 hole factor of 1; in the pyramid model of frequency characteristics in section c, DCT3D-1 refers to three-dimensional discrete cosine transform and three-dimensional inverse discrete cosine transform with frequency coefficient 1, DCT3D-16 refers to three-dimensional discrete cosine transform and three-dimensional inverse discrete cosine transform with frequency coefficient 16, DCT3D-32 refers to three-dimensional discrete cosine transform and three-dimensional inverse discrete cosine transform with frequency coefficient 32, and DCT3D-64 refers to three-dimensional discrete cosine transform and three-dimensional inverse discrete cosine transform with frequency coefficient 64. Softmax compresses features of different scales to between 0-1, then multiplies the inputs of section c and adds them together, which means enhancement of features on different scales, which is the attention mechanism described.

During training of the network, the network is initialized by using parameters of the first 10 layers of VGG16, and the rest of the network is randomly initialized by Gaussian distribution with the average value of 0 and the standard deviation of 0.01. The learning rate is set to 5e-6, the optimization algorithm is Adam, the initial value of momentum is 0.95 and decays at a rate of 5e-4, and the loss function is the global-local consistency loss function.

In summary, the method for predicting the density of the target in the image according to the embodiment of the present invention has the following advantages:

according to the embodiment of the invention, the relation between different feature diagram channels is considered, the frequency feature pyramid is constructed by adopting three-dimensional discrete cosine transform (3DDCT) and three-dimensional inverse discrete cosine transform (3D IDCT), multi-scale frequency information can be extracted, and the feature diagram does not need to be scaled in the feature extraction process, so that the obtained feature diagram can be ensured not to lose excessive detail information; frequency multi-scale features are further fused and enhanced through an attention mechanism, so that a high-quality density prediction graph can be finally generated; meanwhile, the designed loss function fully considers the consistency of the local error and the global error, so that better robustness can be achieved when outliers appear in the prediction process.

The present embodiments also include an apparatus for density prediction of objects in an image, which may include a processor and a memory. Wherein the content of the first and second substances,

the memory is used for storing program instructions;

the processor is used for reading the program instructions in the memory and executing the method for density prediction of the target in the image according to the embodiment.

The memory may also be produced separately and used for storing a computer program corresponding to the method of density prediction of objects in an image. When the memory is connected to the processor, the stored computer program is read out by the processor and executed, so as to implement the method for predicting the density of the target in the image, and achieve the technical effects described in the embodiments.

The present embodiment also includes a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method for density prediction of an object in an image as shown in the embodiment.

It should be noted that, unless otherwise specified, when a feature is referred to as being "fixed" or "connected" to another feature, it may be directly fixed or connected to the other feature or indirectly fixed or connected to the other feature. Furthermore, the descriptions of upper, lower, left, right, etc. used in the present disclosure are only relative to the mutual positional relationship of the constituent parts of the present disclosure in the drawings. As used in this disclosure, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. In addition, unless defined otherwise, all technical and scientific terms used in this example have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the description of the embodiments herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this embodiment, the term "and/or" includes any combination of one or more of the associated listed items.

It will be understood that, although the terms first, second, third, fourth, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element of the same type from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure. The use of any and all examples, or exemplary language ("e.g.," such as "or the like") provided with this embodiment is intended merely to better illuminate embodiments of the invention and does not pose a limitation on the scope of the invention unless otherwise claimed.

It should be recognized that embodiments of the present invention can be realized and implemented by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The methods may be implemented in a computer program using standard programming techniques, including a non-transitory computer-readable storage medium configured with the computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner, according to the methods and figures described in the detailed description. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.

Further, operations of processes described in this embodiment can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described in this embodiment (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) collectively executed on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.

Further, the method may be implemented in any type of computing platform operatively connected to a suitable interface, including but not limited to a personal computer, mini computer, mainframe, workstation, networked or distributed computing environment, separate or integrated computer platform, or in communication with a charged particle tool or other imaging device, and the like. Aspects of the invention may be embodied in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optically read and/or write storage medium, RAM, ROM, or the like, such that it may be read by a programmable computer, which when read by the storage medium or device, is operative to configure and operate the computer to perform the procedures described herein. Further, the machine-readable code, or portions thereof, may be transmitted over a wired or wireless network. The invention described in this embodiment includes these and other different types of non-transitory computer-readable storage media when such media include instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. The invention also includes the computer itself when programmed according to the methods and techniques described herein.

A computer program can be applied to input data to perform the functions described in the present embodiment to convert the input data to generate output data that is stored to a non-volatile memory. The output information may also be applied to one or more output devices, such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including particular visual depictions of physical and tangible objects produced on a display.

The above description is only a preferred embodiment of the present invention, and the present invention is not limited to the above embodiment, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means. The invention is capable of other modifications and variations in its technical solution and/or its implementation, within the scope of protection of the invention.

Claims

1. A method of density prediction for an object in an image, comprising:

extracting shallow features of the image to obtain a first feature map;

fusing the plurality of third feature maps to obtain a fourth feature map;

2. The method of claim 1, wherein the pyramid model of frequency features is constructed by three-dimensional discrete cosine transform and three-dimensional inverse discrete cosine transform.

3. The method according to claim 2, wherein the step of processing the first feature map by using the pyramid model of frequency features includes:

4. The method of claim 2, wherein the three-dimensional discrete cosine transform and the three-dimensional inverse discrete cosine transform are continued in the channel dimension of the first feature map after the transform in both the column and row directions of the first feature map; the formulas of the three-dimensional discrete cosine transform and the three-dimensional inverse discrete cosine transform are as follows:

(three-dimensional discrete cosine transform);

(three-dimensional inverse discrete cosine transform);

wherein ,

5. The method of claim 1, wherein the attention mechanism enhancing the weight matrix is performed by the following formula:

F_i,c(x)＝(1+H_i,c(x)×G_i,c(x))，

6. The method of claim 5, further comprising training the pyramid model of frequency features, comprising:

the loss function is minimized.

7. The method of claim 6, wherein the loss function is:

8. The method of claim 7, wherein the gradient magnitude similarity is performed by the following formula:

wherein ,

wherein c is a normal number, Y^pIs a predicted density map, Y is an actual density map, m_Yp(i) Is the gradient magnitude, m, of the predicted density map at point i_Y(i) Is the gradient amplitude of the actual density map at point i, GMS (i) is predictedThe density map and the actual density map have a similarity in gradient magnitude at point i,

9. An apparatus for density prediction of an object in an image, comprising a processor and a memory, wherein the memory is configured to store program instructions;

the processor is used for reading the program instructions in the memory and executing the method for predicting the density of the target in the image according to any one of claims 1 to 8 according to the program instructions in the memory.

10. A computer-readable storage medium, characterized in that,

a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs a method of density prediction of an object in an image as claimed in any one of claims 1 to 8.