Disclosure of Invention
To solve at least one of the above problems, it is an object of the present invention to provide a method, an apparatus, and a storage medium for density prediction of an object in an image.
The technical scheme adopted by the invention is as follows: the embodiment of the invention comprises a method for predicting the density of a target in an image, which comprises the following steps:
extracting shallow features of the image to obtain a first feature map;
processing the first characteristic diagram by using a frequency characteristic pyramid model to obtain a plurality of second characteristic diagrams with different scales;
performing convolution processing on the plurality of second feature maps with different scales respectively to obtain a plurality of third feature maps;
fusing the plurality of third feature maps to obtain a fourth feature map;
generating a weight matrix through a softmax function according to the fourth feature map;
and enhancing the weight matrix through an attention mechanism to generate an image target density prediction graph.
Further, the method adopts three-dimensional discrete cosine transform and three-dimensional inverse discrete cosine transform to construct the frequency characteristic pyramid model.
Further, the step of processing the first feature map by using the frequency feature pyramid model specifically includes:
converting the first feature map from a spatial domain to a frequency domain by a three-dimensional discrete cosine transform;
extracting images of a plurality of different frequencies in a frequency domain;
and converting the images with different frequencies into a plurality of second feature maps with different scales through three-dimensional inverse discrete cosine transform.
Further, the three-dimensional discrete cosine transform and the three-dimensional inverse discrete cosine transform are continued transform on the channel dimension of the first feature map after the transform performed in both the column and row directions of the first feature map; the formulas of the three-dimensional discrete cosine transform and the three-dimensional inverse discrete cosine transform are as follows:
(three-dimensional discrete cosine transform);
(three-dimensional inverse discrete cosine transform);
wherein, N represents the number of columns of the first feature map, M represents the number of rows of the first feature map, L represents the number of channels of the first feature map, F (x, y, z) is the feature point at the y-th row and x-th column of the z-th channel, F (u, v, w) is the corresponding frequency feature after discrete cosine transform, and c (u), c (v), and c (w) are the corresponding compensation coefficients.
Further, the attention mechanism enhancing the weight matrix is performed by the following formula:
Fi,c(x)=(1+Hi,c(x)×Gi,c(x)),
wherein g (x) is the input of the frequency feature pyramid model, h (x) is a weight matrix generated by a softmax function, the range of which is [0, 1], f (x) is the feature after multi-scale information enhancement, i is the ith feature map channel, and c represents the point of the position c on the feature map.
Further, the method further comprises training the frequency feature pyramid model, including:
constructing a training set, wherein the training set is composed of different feature graphs;
inputting the training set into a frequency characteristic pyramid model, and predicting the image target density;
calculating a difference value between the predicted value and the true value by using a loss function;
the loss function is minimized.
Further, the loss function is:
where Y is the actual density map, X is the input image, θ is the parameter of the frequency feature pyramid model, F (X, θ) represents the frequency feature pyramid model, gms (i) refers to the gradient amplitude similarity of the prediction map and the actual map at point i, and N is the total number of pixels in the input image.
Further, the gradient magnitude similarity is performed by the following formula:
wherein c is a normal number, Y
pIs a predicted density map, Y is an actual density map, m
Yp(i) Is the gradient magnitude, m, of the predicted density map at point i
Y(i) Is the gradient amplitude of the actual density map at point i, gms (i) is the similarity of the gradient amplitudes of the predicted and actual density maps at point i,
refers to volume and operation, h
xRefer to the Prewitt operator, h, in the horizontal direction
yRefer to the Prewitt operator in the vertical direction.
In another aspect, embodiments of the present invention further include an apparatus for density prediction of objects in an image, comprising a processor and a memory, wherein,
the memory is to store program instructions;
the processor is used for reading the program instructions in the memory and executing the method for density prediction of the target in the image according to the program instructions in the memory.
In another aspect, the present invention further includes a computer readable storage medium, on which a computer program is stored, which, when executed by a processor, performs the illustrated method for density prediction of an object in an image.
The invention has the beneficial effects that: according to the method, the relationship among different feature graph channels is considered, the frequency feature pyramid is constructed by adopting three-dimensional discrete cosine transform (3D DCT) and three-dimensional inverse discrete cosine transform (3D IDCT), multi-scale frequency information can be extracted, and in the feature extraction process, the feature graph does not need to be scaled, so that the obtained feature graph can be ensured not to lose excessive detail information; frequency multi-scale features are further fused and enhanced through an attention mechanism, so that a high-quality density prediction graph can be finally generated; meanwhile, the designed loss function fully considers the consistency of the local error and the global error, so that better robustness can be achieved when outliers appear in the prediction process.
Detailed Description
Fig. 1 is a flow chart of the steps of a method for density prediction of an object in an image, as shown in fig. 1, the method comprising the processing steps of:
s1, extracting shallow features of an image to obtain a first feature map;
s2, processing the first feature map by using a frequency feature pyramid model to obtain a plurality of second feature maps with different scales;
s3, performing convolution processing on the plurality of second feature maps with different scales respectively to obtain a plurality of third feature maps;
s4, fusing the plurality of third feature maps to obtain a fourth feature map;
s5, generating a weight matrix through a softmax function according to the fourth feature map;
and S6, enhancing the weight matrix through an attention mechanism to generate an image target density prediction graph.
In this embodiment, the pyramid model of frequency characteristics described in step S2 is constructed by using three-dimensional discrete cosine transform (3DDCT) and three-dimensional inverse discrete cosine transform (3D IDCT), the three-dimensional discrete cosine transform and the three-dimensional inverse discrete cosine transform are popularized by a one-dimensional discrete cosine transform (1D DCT) and a one-dimensional inverse discrete cosine transform (1DIDCT), the one-dimensional discrete cosine transform and the one-dimensional inverse discrete cosine transform are carried out in the column direction of the characteristic diagram, the two-dimensional discrete cosine transform and the two-dimensional inverse discrete cosine transform are carried out in the two directions of the column and the row of the characteristic diagram, the three-dimensional discrete cosine transform and the three-dimensional inverse discrete cosine transform are added with the transform in the dimension of the characteristic diagram channel on the basis of the two-dimensional discrete cosine transform and the two-dimensional inverse discrete cosine transform. The one-dimensional discrete cosine transform and the one-dimensional inverse discrete cosine transform are as follows:
one-dimensional discrete cosine transform:
one-dimensional inverse discrete cosine transform:
wherein N is the total number of original signals, f (x) is the x-th original signal, F (u) is the frequency signal after discrete cosine transform, u is the frequency coefficient, c (u) is the compensation coefficient.
According to the formulas of one-dimensional discrete cosine transform and one-dimensional inverse discrete cosine transform, the formulas of three-dimensional discrete cosine transform (3D DCT) and three-dimensional inverse discrete cosine transform (3D IDCT) can be generalized as follows:
(three-dimensional discrete cosine transform);
(three-dimensional inverse discrete cosine transform);
wherein, N represents the number of columns of the first feature map, M represents the number of rows of the first feature map, L represents the number of channels of the first feature map, F (x, y, z) is the feature point at the y-th row and x-th column of the z-th channel, F (u, v, w) is the corresponding frequency feature after discrete cosine transform, and c (u), c (v), and c (w) are the corresponding compensation coefficients.
In the implementation, a Front-end network (Front-end network) is used for extracting shallow features of an image to obtain a first feature map; the obtained first feature map is input to a frequency feature pyramid model for processing, and referring to fig. 2, T in fig. 2 is a three-dimensional discrete cosine transform (3D DCT) and a three-dimensional inverse discrete cosine transform (3D IDCT) operation, C is a convolution operation, W is a softmax operation in order to generate a weight matrix of an attention mechanism, and Concat is a series operation in this dimension of a channel. In this embodiment, after receiving the first feature map, the frequency feature pyramid model converts the first feature map from a spatial domain to a frequency domain through three-dimensional discrete cosine transform to extract a plurality of images with different frequencies in the frequency domain; in this embodiment, 4 images with frequencies of 1/4, 1/16, 1/64, and 1/256 are taken, and are converted into 4 second feature maps with different scales through three-dimensional inverse discrete cosine transform, and correspondingly, four parallel rows of feature maps in fig. 2 are obtained, multi-scale features are further extracted through convolution, then, the feature maps with 4 scales are fused together through a series operation, and then, a softmax function is used to generate a weight matrix. And finally, enhancing the weight matrix through an attention mechanism, specifically, performing pixel-by-pixel multiplication and addition operation on the first feature image to extract high-level semantic features, and generating a high-quality image target density prediction image by using a back-end network (back-end network).
Unlike the conventional multi-scale feature pyramid, the present embodiment represents different scales using different frequencies of the image, and generates a frequency multi-scale feature pyramid model using three-dimensional discrete cosine transform (3D DCT) and three-dimensional inverse discrete cosine transform (3D IDCT). When the frequency feature pyramid is constructed, the feature map does not need to be scaled, so that the obtained feature map can be ensured not to lose excessive detail information. After the discrete cosine transform of the image, the image may be converted from the spatial domain to the frequency domain, and the inverse discrete cosine transform may convert the image from the frequency domain back to the spatial domain, with different frequencies in the frequency domain corresponding to images in the spatial domain without scaling. Therefore, the multi-scale image with the same size as the original image can be obtained after the different frequencies are subjected to the inverse discrete cosine transform. In this embodiment, the pyramid model of frequency features is to convert the input first feature map from the spatial domain to the frequency domain, and then to convert 4 feature maps with frequencies of 1/4, 1/16, 1/64, and 1/256 back to the spatial domain to obtain 4 second feature maps with different scales; performing convolution processing on the 4 second feature maps with different scales respectively to obtain 4 third feature maps with different scales; fusing 4 third feature maps with different scales to obtain a fourth feature map; generating a weight matrix through a softmax function according to the fourth feature map; and enhancing the weight matrix through an attention mechanism to generate an image target density prediction graph.
In this embodiment, the attention mechanism may be defined as:
Fi,c(x)=(1+Hi,c(x)×Gi,c(x)),
wherein g (x) is the input of the frequency feature pyramid model, h (x) is a weight matrix generated by a softmax function, the range of which is [0, 1], f (x) is the feature after multi-scale information enhancement, i is the ith feature map channel, and c represents the point of the position c on the feature map.
The embodiment also designs a new loss function, mainly considering that most of the existing methods adopt MSE as the loss function, but the MSE is only a pixel-level loss function for calculating the global error. Therefore, the embodiment designs a loss function which keeps consistent globally and locally, called global-local consistency loss function, and the specific form is as follows:
where Y is the actual density map, X is the input image, θ is the parameter of the frequency feature pyramid model, F (X, θ) represents the frequency feature pyramid model, gms (i) refers to the gradient amplitude similarity of the prediction map and the actual map at point i, and N is the total number of pixels in the input image.
The Log-Cosh error is a loss function similar to MSE, but when an outlier occurs, the Log-Cosh loss function has better robustness, and when the density of a target in different samples of different scenes of different targets is changed greatly, the Log-Cosh loss function has better performance, but only embodies a global error and does not consider a local error, so that the local error is constrained by using a gradient amplitude similarity GMS (i), and the form is as follows:
wherein c is a normal number, Y
pIs a predicted density map, Y is an actual density map, m
Yp(i) Is the gradient magnitude, m, of the predicted density map at point i
Y(i) Is the gradient amplitude of the actual density map at point i, gms (i) is the similarity of the gradient amplitudes of the predicted and actual density maps at point i,
refers to volume and operation, h
xRefer to the Prewitt operator, h, in the horizontal direction
yRefer to the Prewitt operator in the vertical direction. In this embodiment, the Prewitt operator h
x and h
yIs defined as:
the global-local consistency loss function fully considers the consistency of local errors and global errors, and has better robustness when outliers appear in samples.
In this embodiment, referring to fig. 3, specific settings of network parameters are described by input and output. As shown in fig. 3, the specific setting of the network parameters is composed of three parts, a part is a front-end network of the network, b part is a back-end network of the network, and c is a frequency characteristic pyramid model. In this embodiment, the input of the part a is set as an original RGB image of the crowd, the output of the part a, the input and output of the part b, and the input of the part c are all intermediate feature maps, and the output of the part c is a crowd distribution density map of the final network output. In the front-end network in the part a, Conv3-64-1 refers to a 3x3 convolution kernel with 64 hole factors of 1, Conv3-128-1 refers to a 3x3 convolution kernel with 128 hole factors of 1, and Conv3-256-1 refers to a 3x3 convolution kernel with 256 hole factors of 1; max Pooling stands for maximum Pooling operation. Similarly, in the back-end network in section b, Conv3-512-2 refers to a 3x3 convolution kernel with 512 hole factors of 2, Conv3-256-2 refers to a 3x3 convolution kernel with 256 hole factors of 2, Conv3-128-2 refers to a 3x3 convolution kernel with 128 hole factors of 2, Conv3-64-2 refers to a 3x3 convolution kernel with 64 hole factors of 2, and Conv1-1-1 refers to a 1x1 convolution kernel with 1 hole factor of 1; in the pyramid model of frequency characteristics in section c, DCT3D-1 refers to three-dimensional discrete cosine transform and three-dimensional inverse discrete cosine transform with frequency coefficient 1, DCT3D-16 refers to three-dimensional discrete cosine transform and three-dimensional inverse discrete cosine transform with frequency coefficient 16, DCT3D-32 refers to three-dimensional discrete cosine transform and three-dimensional inverse discrete cosine transform with frequency coefficient 32, and DCT3D-64 refers to three-dimensional discrete cosine transform and three-dimensional inverse discrete cosine transform with frequency coefficient 64. Softmax compresses features of different scales to between 0-1, then multiplies the inputs of section c and adds them together, which means enhancement of features on different scales, which is the attention mechanism described.
During training of the network, the network is initialized by using parameters of the first 10 layers of VGG16, and the rest of the network is randomly initialized by Gaussian distribution with the average value of 0 and the standard deviation of 0.01. The learning rate is set to 5e-6, the optimization algorithm is Adam, the initial value of momentum is 0.95 and decays at a rate of 5e-4, and the loss function is the global-local consistency loss function.
In summary, the method for predicting the density of the target in the image according to the embodiment of the present invention has the following advantages:
according to the embodiment of the invention, the relation between different feature diagram channels is considered, the frequency feature pyramid is constructed by adopting three-dimensional discrete cosine transform (3DDCT) and three-dimensional inverse discrete cosine transform (3D IDCT), multi-scale frequency information can be extracted, and the feature diagram does not need to be scaled in the feature extraction process, so that the obtained feature diagram can be ensured not to lose excessive detail information; frequency multi-scale features are further fused and enhanced through an attention mechanism, so that a high-quality density prediction graph can be finally generated; meanwhile, the designed loss function fully considers the consistency of the local error and the global error, so that better robustness can be achieved when outliers appear in the prediction process.
The present embodiments also include an apparatus for density prediction of objects in an image, which may include a processor and a memory. Wherein the content of the first and second substances,
the memory is used for storing program instructions;
the processor is used for reading the program instructions in the memory and executing the method for density prediction of the target in the image according to the embodiment.
The memory may also be produced separately and used for storing a computer program corresponding to the method of density prediction of objects in an image. When the memory is connected to the processor, the stored computer program is read out by the processor and executed, so as to implement the method for predicting the density of the target in the image, and achieve the technical effects described in the embodiments.
The present embodiment also includes a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method for density prediction of an object in an image as shown in the embodiment.
It should be noted that, unless otherwise specified, when a feature is referred to as being "fixed" or "connected" to another feature, it may be directly fixed or connected to the other feature or indirectly fixed or connected to the other feature. Furthermore, the descriptions of upper, lower, left, right, etc. used in the present disclosure are only relative to the mutual positional relationship of the constituent parts of the present disclosure in the drawings. As used in this disclosure, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. In addition, unless defined otherwise, all technical and scientific terms used in this example have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the description of the embodiments herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this embodiment, the term "and/or" includes any combination of one or more of the associated listed items.
It will be understood that, although the terms first, second, third, fourth, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element of the same type from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure. The use of any and all examples, or exemplary language ("e.g.," such as "or the like") provided with this embodiment is intended merely to better illuminate embodiments of the invention and does not pose a limitation on the scope of the invention unless otherwise claimed.
It should be recognized that embodiments of the present invention can be realized and implemented by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The methods may be implemented in a computer program using standard programming techniques, including a non-transitory computer-readable storage medium configured with the computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner, according to the methods and figures described in the detailed description. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.
Further, operations of processes described in this embodiment can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described in this embodiment (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) collectively executed on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.
Further, the method may be implemented in any type of computing platform operatively connected to a suitable interface, including but not limited to a personal computer, mini computer, mainframe, workstation, networked or distributed computing environment, separate or integrated computer platform, or in communication with a charged particle tool or other imaging device, and the like. Aspects of the invention may be embodied in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optically read and/or write storage medium, RAM, ROM, or the like, such that it may be read by a programmable computer, which when read by the storage medium or device, is operative to configure and operate the computer to perform the procedures described herein. Further, the machine-readable code, or portions thereof, may be transmitted over a wired or wireless network. The invention described in this embodiment includes these and other different types of non-transitory computer-readable storage media when such media include instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. The invention also includes the computer itself when programmed according to the methods and techniques described herein.
A computer program can be applied to input data to perform the functions described in the present embodiment to convert the input data to generate output data that is stored to a non-volatile memory. The output information may also be applied to one or more output devices, such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including particular visual depictions of physical and tangible objects produced on a display.
The above description is only a preferred embodiment of the present invention, and the present invention is not limited to the above embodiment, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means. The invention is capable of other modifications and variations in its technical solution and/or its implementation, within the scope of protection of the invention.