CN111310805A - Method, device and medium for predicting density of target in image - Google Patents

Method, device and medium for predicting density of target in image Download PDF

Info

Publication number
CN111310805A
CN111310805A CN202010074908.2A CN202010074908A CN111310805A CN 111310805 A CN111310805 A CN 111310805A CN 202010074908 A CN202010074908 A CN 202010074908A CN 111310805 A CN111310805 A CN 111310805A
Authority
CN
China
Prior art keywords
feature
discrete cosine
cosine transform
map
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010074908.2A
Other languages
Chinese (zh)
Other versions
CN111310805B (en
Inventor
梁延研
于晓渊
林旭新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Boyan Technology Zhuhai Co ltd
Original Assignee
China Energy International Construction Investment Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Energy International Construction Investment Group Co ltd filed Critical China Energy International Construction Investment Group Co ltd
Priority to CN202010074908.2A priority Critical patent/CN111310805B/en
Publication of CN111310805A publication Critical patent/CN111310805A/en
Application granted granted Critical
Publication of CN111310805B publication Critical patent/CN111310805B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • G06V10/431Frequency domain transformation; Autocorrelation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method, a device and a storage medium for predicting density of an object in an image. The method considers the relation between different feature map channels, adopts three-dimensional discrete cosine transform (3D DCT) and three-dimensional inverse discrete cosine transform (3D IDCT) to construct a frequency feature pyramid, can extract multi-scale frequency information, and does not need to scale the feature map in the feature extraction process, thereby ensuring that the obtained feature map does not lose excessive detail information; frequency multi-scale features are further fused and enhanced through an attention mechanism, so that a high-quality density prediction graph can be finally generated; meanwhile, the designed loss function fully considers the consistency of the local error and the global error, so that better robustness can be achieved when outliers appear in the prediction process. The invention is widely applied to the technical field of image processing.

Description

Method, device and medium for predicting density of target in image
Technical Field
The present invention relates to the field of image processing technologies, and in particular, to a method, an apparatus, and a storage medium for performing density prediction on an object in an image.
Background
The density estimation of dense crowds and traffic flow is one of important technologies in most relevant applications of smart cities, the density estimation can be used for counting the number of actual people and passing vehicles, and the distribution of crowds and traffic flow can be displayed, so that the density estimation plays an important role in many practical applications such as crowd management, safety protection, city planning, consumer consumption behavior analysis, municipal traffic planning and the like. For example, in some scenes, a large amount of crowds gathering at famous tourist attractions or meeting events in holidays can cause trampling accidents, which can bring great harm to people; the result of the crowd counting in the waiting hall can optimize the dispatching of public transportation; the change in the population in a certain area may result in an accident or be the result of an accident; the consumer consumption behavior pattern can be analyzed by the counting results of people in different periods of time in the shopping mall; the method can analyze the customers frequently suffering from periodical congestion on a certain road section, plan municipal traffic, optimize scheduling and the like.
At present, the method for estimating the density of the target in the image is mainly a traditional method based on bottom layer feature fitting regression and a method based on deep learning. The traditional method based on bottom layer feature fitting regression firstly needs to manually design and extract various features, then trains a linear or nonlinear function based on the features to return a density map, and generally comprises three main steps: foreground segmentation, feature extraction and density map estimation. The foreground segmentation aims to segment people or traffic flow from an image so as to facilitate subsequent feature extraction, and the segmentation performance directly influences the final counting precision, so that the foreground segmentation is an important factor limiting the performance of the traditional algorithm; and the feature extraction is to extract various bottom layer features from the segmented foreground, and the density estimation is to regress the extracted features to the crowd or traffic flow distribution in the image. Features in the image must be abstracted and expressed before the regression model is built. Feature expression is generally related to feature extraction, selection and transformation of underlying visual attributes by constructing intermediate inputs into a regression model to estimate the distribution of people or traffic in an image. However, the three steps are independent from each other, separate optimization is needed, the connection between the steps is lacked, and the optimization of the whole body by forming a resultant force is difficult. The deep learning-based method can automatically learn from big data through a deep neural network to obtain the high-efficiency expression of the target characteristics, and the well learned characteristics can greatly improve the system performance. Although current methods based on deep learning have made key breakthroughs in the task of density estimation, there are still key problems that are difficult to solve completely: target features in an image are easily interfered by a plurality of external uncertain factors, for example, the appearance of a target is easily influenced by scales, postures, visual angles and the like, the target in the same scene has a plurality of different scales, and the problems of severe shielding, inconsistency of a density map with actual distribution and the like exist generally.
Disclosure of Invention
To solve at least one of the above problems, it is an object of the present invention to provide a method, an apparatus, and a storage medium for density prediction of an object in an image.
The technical scheme adopted by the invention is as follows: the embodiment of the invention comprises a method for predicting the density of a target in an image, which comprises the following steps:
extracting shallow features of the image to obtain a first feature map;
processing the first characteristic diagram by using a frequency characteristic pyramid model to obtain a plurality of second characteristic diagrams with different scales;
performing convolution processing on the plurality of second feature maps with different scales respectively to obtain a plurality of third feature maps;
fusing the plurality of third feature maps to obtain a fourth feature map;
generating a weight matrix through a softmax function according to the fourth feature map;
and enhancing the weight matrix through an attention mechanism to generate an image target density prediction graph.
Further, the method adopts three-dimensional discrete cosine transform and three-dimensional inverse discrete cosine transform to construct the frequency characteristic pyramid model.
Further, the step of processing the first feature map by using the frequency feature pyramid model specifically includes:
converting the first feature map from a spatial domain to a frequency domain by a three-dimensional discrete cosine transform;
extracting images of a plurality of different frequencies in a frequency domain;
and converting the images with different frequencies into a plurality of second feature maps with different scales through three-dimensional inverse discrete cosine transform.
Further, the three-dimensional discrete cosine transform and the three-dimensional inverse discrete cosine transform are continued transform on the channel dimension of the first feature map after the transform performed in both the column and row directions of the first feature map; the formulas of the three-dimensional discrete cosine transform and the three-dimensional inverse discrete cosine transform are as follows:
Figure BDA0002378240920000021
(three-dimensional discrete cosine transform);
Figure BDA0002378240920000022
(three-dimensional inverse discrete cosine transform);
wherein ,
Figure BDA0002378240920000023
wherein, N represents the number of columns of the first feature map, M represents the number of rows of the first feature map, L represents the number of channels of the first feature map, F (x, y, z) is the feature point at the y-th row and x-th column of the z-th channel, F (u, v, w) is the corresponding frequency feature after discrete cosine transform, and c (u), c (v), and c (w) are the corresponding compensation coefficients.
Further, the attention mechanism enhancing the weight matrix is performed by the following formula:
Fi,c(x)=(1+Hi,c(x)×Gi,c(x)),
wherein g (x) is the input of the frequency feature pyramid model, h (x) is a weight matrix generated by a softmax function, the range of which is [0, 1], f (x) is the feature after multi-scale information enhancement, i is the ith feature map channel, and c represents the point of the position c on the feature map.
Further, the method further comprises training the frequency feature pyramid model, including:
constructing a training set, wherein the training set is composed of different feature graphs;
inputting the training set into a frequency characteristic pyramid model, and predicting the image target density;
calculating a difference value between the predicted value and the true value by using a loss function;
the loss function is minimized.
Further, the loss function is:
Figure BDA0002378240920000031
where Y is the actual density map, X is the input image, θ is the parameter of the frequency feature pyramid model, F (X, θ) represents the frequency feature pyramid model, gms (i) refers to the gradient amplitude similarity of the prediction map and the actual map at point i, and N is the total number of pixels in the input image.
Further, the gradient magnitude similarity is performed by the following formula:
Figure BDA0002378240920000032
wherein ,
Figure BDA0002378240920000033
wherein c is a normal number, YpIs a predicted density map, Y is an actual density map, mYp(i) Is the gradient magnitude, m, of the predicted density map at point iY(i) Is the gradient amplitude of the actual density map at point i, gms (i) is the similarity of the gradient amplitudes of the predicted and actual density maps at point i,
Figure BDA0002378240920000034
refers to volume and operation, hxRefer to the Prewitt operator, h, in the horizontal directionyRefer to the Prewitt operator in the vertical direction.
In another aspect, embodiments of the present invention further include an apparatus for density prediction of objects in an image, comprising a processor and a memory, wherein,
the memory is to store program instructions;
the processor is used for reading the program instructions in the memory and executing the method for density prediction of the target in the image according to the program instructions in the memory.
In another aspect, the present invention further includes a computer readable storage medium, on which a computer program is stored, which, when executed by a processor, performs the illustrated method for density prediction of an object in an image.
The invention has the beneficial effects that: according to the method, the relationship among different feature graph channels is considered, the frequency feature pyramid is constructed by adopting three-dimensional discrete cosine transform (3D DCT) and three-dimensional inverse discrete cosine transform (3D IDCT), multi-scale frequency information can be extracted, and in the feature extraction process, the feature graph does not need to be scaled, so that the obtained feature graph can be ensured not to lose excessive detail information; frequency multi-scale features are further fused and enhanced through an attention mechanism, so that a high-quality density prediction graph can be finally generated; meanwhile, the designed loss function fully considers the consistency of the local error and the global error, so that better robustness can be achieved when outliers appear in the prediction process.
Drawings
FIG. 1 is a flowchart illustrating steps of a method for density prediction of an object in an image according to an embodiment;
FIG. 2 is a schematic structural diagram of the pyramid model with frequency characteristics in an embodiment;
fig. 3 is a schematic diagram of specific setting of the network parameters in the embodiment.
Detailed Description
Fig. 1 is a flow chart of the steps of a method for density prediction of an object in an image, as shown in fig. 1, the method comprising the processing steps of:
s1, extracting shallow features of an image to obtain a first feature map;
s2, processing the first feature map by using a frequency feature pyramid model to obtain a plurality of second feature maps with different scales;
s3, performing convolution processing on the plurality of second feature maps with different scales respectively to obtain a plurality of third feature maps;
s4, fusing the plurality of third feature maps to obtain a fourth feature map;
s5, generating a weight matrix through a softmax function according to the fourth feature map;
and S6, enhancing the weight matrix through an attention mechanism to generate an image target density prediction graph.
In this embodiment, the pyramid model of frequency characteristics described in step S2 is constructed by using three-dimensional discrete cosine transform (3DDCT) and three-dimensional inverse discrete cosine transform (3D IDCT), the three-dimensional discrete cosine transform and the three-dimensional inverse discrete cosine transform are popularized by a one-dimensional discrete cosine transform (1D DCT) and a one-dimensional inverse discrete cosine transform (1DIDCT), the one-dimensional discrete cosine transform and the one-dimensional inverse discrete cosine transform are carried out in the column direction of the characteristic diagram, the two-dimensional discrete cosine transform and the two-dimensional inverse discrete cosine transform are carried out in the two directions of the column and the row of the characteristic diagram, the three-dimensional discrete cosine transform and the three-dimensional inverse discrete cosine transform are added with the transform in the dimension of the characteristic diagram channel on the basis of the two-dimensional discrete cosine transform and the two-dimensional inverse discrete cosine transform. The one-dimensional discrete cosine transform and the one-dimensional inverse discrete cosine transform are as follows:
one-dimensional discrete cosine transform:
Figure BDA0002378240920000051
one-dimensional inverse discrete cosine transform:
Figure BDA0002378240920000052
wherein ,
Figure BDA0002378240920000053
wherein N is the total number of original signals, f (x) is the x-th original signal, F (u) is the frequency signal after discrete cosine transform, u is the frequency coefficient, c (u) is the compensation coefficient.
According to the formulas of one-dimensional discrete cosine transform and one-dimensional inverse discrete cosine transform, the formulas of three-dimensional discrete cosine transform (3D DCT) and three-dimensional inverse discrete cosine transform (3D IDCT) can be generalized as follows:
Figure BDA0002378240920000054
(three-dimensional discrete cosine transform);
Figure BDA0002378240920000055
(three-dimensional inverse discrete cosine transform);
wherein ,
Figure BDA0002378240920000056
wherein, N represents the number of columns of the first feature map, M represents the number of rows of the first feature map, L represents the number of channels of the first feature map, F (x, y, z) is the feature point at the y-th row and x-th column of the z-th channel, F (u, v, w) is the corresponding frequency feature after discrete cosine transform, and c (u), c (v), and c (w) are the corresponding compensation coefficients.
In the implementation, a Front-end network (Front-end network) is used for extracting shallow features of an image to obtain a first feature map; the obtained first feature map is input to a frequency feature pyramid model for processing, and referring to fig. 2, T in fig. 2 is a three-dimensional discrete cosine transform (3D DCT) and a three-dimensional inverse discrete cosine transform (3D IDCT) operation, C is a convolution operation, W is a softmax operation in order to generate a weight matrix of an attention mechanism, and Concat is a series operation in this dimension of a channel. In this embodiment, after receiving the first feature map, the frequency feature pyramid model converts the first feature map from a spatial domain to a frequency domain through three-dimensional discrete cosine transform to extract a plurality of images with different frequencies in the frequency domain; in this embodiment, 4 images with frequencies of 1/4, 1/16, 1/64, and 1/256 are taken, and are converted into 4 second feature maps with different scales through three-dimensional inverse discrete cosine transform, and correspondingly, four parallel rows of feature maps in fig. 2 are obtained, multi-scale features are further extracted through convolution, then, the feature maps with 4 scales are fused together through a series operation, and then, a softmax function is used to generate a weight matrix. And finally, enhancing the weight matrix through an attention mechanism, specifically, performing pixel-by-pixel multiplication and addition operation on the first feature image to extract high-level semantic features, and generating a high-quality image target density prediction image by using a back-end network (back-end network).
Unlike the conventional multi-scale feature pyramid, the present embodiment represents different scales using different frequencies of the image, and generates a frequency multi-scale feature pyramid model using three-dimensional discrete cosine transform (3D DCT) and three-dimensional inverse discrete cosine transform (3D IDCT). When the frequency feature pyramid is constructed, the feature map does not need to be scaled, so that the obtained feature map can be ensured not to lose excessive detail information. After the discrete cosine transform of the image, the image may be converted from the spatial domain to the frequency domain, and the inverse discrete cosine transform may convert the image from the frequency domain back to the spatial domain, with different frequencies in the frequency domain corresponding to images in the spatial domain without scaling. Therefore, the multi-scale image with the same size as the original image can be obtained after the different frequencies are subjected to the inverse discrete cosine transform. In this embodiment, the pyramid model of frequency features is to convert the input first feature map from the spatial domain to the frequency domain, and then to convert 4 feature maps with frequencies of 1/4, 1/16, 1/64, and 1/256 back to the spatial domain to obtain 4 second feature maps with different scales; performing convolution processing on the 4 second feature maps with different scales respectively to obtain 4 third feature maps with different scales; fusing 4 third feature maps with different scales to obtain a fourth feature map; generating a weight matrix through a softmax function according to the fourth feature map; and enhancing the weight matrix through an attention mechanism to generate an image target density prediction graph.
In this embodiment, the attention mechanism may be defined as:
Fi,c(x)=(1+Hi,c(x)×Gi,c(x)),
wherein g (x) is the input of the frequency feature pyramid model, h (x) is a weight matrix generated by a softmax function, the range of which is [0, 1], f (x) is the feature after multi-scale information enhancement, i is the ith feature map channel, and c represents the point of the position c on the feature map.
The embodiment also designs a new loss function, mainly considering that most of the existing methods adopt MSE as the loss function, but the MSE is only a pixel-level loss function for calculating the global error. Therefore, the embodiment designs a loss function which keeps consistent globally and locally, called global-local consistency loss function, and the specific form is as follows:
Figure BDA0002378240920000061
where Y is the actual density map, X is the input image, θ is the parameter of the frequency feature pyramid model, F (X, θ) represents the frequency feature pyramid model, gms (i) refers to the gradient amplitude similarity of the prediction map and the actual map at point i, and N is the total number of pixels in the input image.
The Log-Cosh error is a loss function similar to MSE, but when an outlier occurs, the Log-Cosh loss function has better robustness, and when the density of a target in different samples of different scenes of different targets is changed greatly, the Log-Cosh loss function has better performance, but only embodies a global error and does not consider a local error, so that the local error is constrained by using a gradient amplitude similarity GMS (i), and the form is as follows:
Figure BDA0002378240920000071
wherein ,
Figure BDA0002378240920000072
wherein c is a normal number, YpIs a predicted density map, Y is an actual density map, mYp(i) Is the gradient magnitude, m, of the predicted density map at point iY(i) Is the gradient amplitude of the actual density map at point i, gms (i) is the similarity of the gradient amplitudes of the predicted and actual density maps at point i,
Figure BDA0002378240920000073
refers to volume and operation, hxRefer to the Prewitt operator, h, in the horizontal directionyRefer to the Prewitt operator in the vertical direction. In this embodiment, the Prewitt operator hx and hyIs defined as:
Figure BDA0002378240920000074
the global-local consistency loss function fully considers the consistency of local errors and global errors, and has better robustness when outliers appear in samples.
In this embodiment, referring to fig. 3, specific settings of network parameters are described by input and output. As shown in fig. 3, the specific setting of the network parameters is composed of three parts, a part is a front-end network of the network, b part is a back-end network of the network, and c is a frequency characteristic pyramid model. In this embodiment, the input of the part a is set as an original RGB image of the crowd, the output of the part a, the input and output of the part b, and the input of the part c are all intermediate feature maps, and the output of the part c is a crowd distribution density map of the final network output. In the front-end network in the part a, Conv3-64-1 refers to a 3x3 convolution kernel with 64 hole factors of 1, Conv3-128-1 refers to a 3x3 convolution kernel with 128 hole factors of 1, and Conv3-256-1 refers to a 3x3 convolution kernel with 256 hole factors of 1; max Pooling stands for maximum Pooling operation. Similarly, in the back-end network in section b, Conv3-512-2 refers to a 3x3 convolution kernel with 512 hole factors of 2, Conv3-256-2 refers to a 3x3 convolution kernel with 256 hole factors of 2, Conv3-128-2 refers to a 3x3 convolution kernel with 128 hole factors of 2, Conv3-64-2 refers to a 3x3 convolution kernel with 64 hole factors of 2, and Conv1-1-1 refers to a 1x1 convolution kernel with 1 hole factor of 1; in the pyramid model of frequency characteristics in section c, DCT3D-1 refers to three-dimensional discrete cosine transform and three-dimensional inverse discrete cosine transform with frequency coefficient 1, DCT3D-16 refers to three-dimensional discrete cosine transform and three-dimensional inverse discrete cosine transform with frequency coefficient 16, DCT3D-32 refers to three-dimensional discrete cosine transform and three-dimensional inverse discrete cosine transform with frequency coefficient 32, and DCT3D-64 refers to three-dimensional discrete cosine transform and three-dimensional inverse discrete cosine transform with frequency coefficient 64. Softmax compresses features of different scales to between 0-1, then multiplies the inputs of section c and adds them together, which means enhancement of features on different scales, which is the attention mechanism described.
During training of the network, the network is initialized by using parameters of the first 10 layers of VGG16, and the rest of the network is randomly initialized by Gaussian distribution with the average value of 0 and the standard deviation of 0.01. The learning rate is set to 5e-6, the optimization algorithm is Adam, the initial value of momentum is 0.95 and decays at a rate of 5e-4, and the loss function is the global-local consistency loss function.
In summary, the method for predicting the density of the target in the image according to the embodiment of the present invention has the following advantages:
according to the embodiment of the invention, the relation between different feature diagram channels is considered, the frequency feature pyramid is constructed by adopting three-dimensional discrete cosine transform (3DDCT) and three-dimensional inverse discrete cosine transform (3D IDCT), multi-scale frequency information can be extracted, and the feature diagram does not need to be scaled in the feature extraction process, so that the obtained feature diagram can be ensured not to lose excessive detail information; frequency multi-scale features are further fused and enhanced through an attention mechanism, so that a high-quality density prediction graph can be finally generated; meanwhile, the designed loss function fully considers the consistency of the local error and the global error, so that better robustness can be achieved when outliers appear in the prediction process.
The present embodiments also include an apparatus for density prediction of objects in an image, which may include a processor and a memory. Wherein the content of the first and second substances,
the memory is used for storing program instructions;
the processor is used for reading the program instructions in the memory and executing the method for density prediction of the target in the image according to the embodiment.
The memory may also be produced separately and used for storing a computer program corresponding to the method of density prediction of objects in an image. When the memory is connected to the processor, the stored computer program is read out by the processor and executed, so as to implement the method for predicting the density of the target in the image, and achieve the technical effects described in the embodiments.
The present embodiment also includes a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the method for density prediction of an object in an image as shown in the embodiment.
It should be noted that, unless otherwise specified, when a feature is referred to as being "fixed" or "connected" to another feature, it may be directly fixed or connected to the other feature or indirectly fixed or connected to the other feature. Furthermore, the descriptions of upper, lower, left, right, etc. used in the present disclosure are only relative to the mutual positional relationship of the constituent parts of the present disclosure in the drawings. As used in this disclosure, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. In addition, unless defined otherwise, all technical and scientific terms used in this example have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the description of the embodiments herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this embodiment, the term "and/or" includes any combination of one or more of the associated listed items.
It will be understood that, although the terms first, second, third, fourth, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element of the same type from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure. The use of any and all examples, or exemplary language ("e.g.," such as "or the like") provided with this embodiment is intended merely to better illuminate embodiments of the invention and does not pose a limitation on the scope of the invention unless otherwise claimed.
It should be recognized that embodiments of the present invention can be realized and implemented by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The methods may be implemented in a computer program using standard programming techniques, including a non-transitory computer-readable storage medium configured with the computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner, according to the methods and figures described in the detailed description. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.
Further, operations of processes described in this embodiment can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described in this embodiment (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) collectively executed on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.
Further, the method may be implemented in any type of computing platform operatively connected to a suitable interface, including but not limited to a personal computer, mini computer, mainframe, workstation, networked or distributed computing environment, separate or integrated computer platform, or in communication with a charged particle tool or other imaging device, and the like. Aspects of the invention may be embodied in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optically read and/or write storage medium, RAM, ROM, or the like, such that it may be read by a programmable computer, which when read by the storage medium or device, is operative to configure and operate the computer to perform the procedures described herein. Further, the machine-readable code, or portions thereof, may be transmitted over a wired or wireless network. The invention described in this embodiment includes these and other different types of non-transitory computer-readable storage media when such media include instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. The invention also includes the computer itself when programmed according to the methods and techniques described herein.
A computer program can be applied to input data to perform the functions described in the present embodiment to convert the input data to generate output data that is stored to a non-volatile memory. The output information may also be applied to one or more output devices, such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including particular visual depictions of physical and tangible objects produced on a display.
The above description is only a preferred embodiment of the present invention, and the present invention is not limited to the above embodiment, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means. The invention is capable of other modifications and variations in its technical solution and/or its implementation, within the scope of protection of the invention.

Claims (10)

1. A method of density prediction for an object in an image, comprising:
extracting shallow features of the image to obtain a first feature map;
processing the first characteristic diagram by using a frequency characteristic pyramid model to obtain a plurality of second characteristic diagrams with different scales;
performing convolution processing on the plurality of second feature maps with different scales respectively to obtain a plurality of third feature maps;
fusing the plurality of third feature maps to obtain a fourth feature map;
generating a weight matrix through a softmax function according to the fourth feature map;
and enhancing the weight matrix through an attention mechanism to generate an image target density prediction graph.
2. The method of claim 1, wherein the pyramid model of frequency features is constructed by three-dimensional discrete cosine transform and three-dimensional inverse discrete cosine transform.
3. The method according to claim 2, wherein the step of processing the first feature map by using the pyramid model of frequency features includes:
converting the first feature map from a spatial domain to a frequency domain by a three-dimensional discrete cosine transform;
extracting images of a plurality of different frequencies in a frequency domain;
and converting the images with different frequencies into a plurality of second feature maps with different scales through three-dimensional inverse discrete cosine transform.
4. The method of claim 2, wherein the three-dimensional discrete cosine transform and the three-dimensional inverse discrete cosine transform are continued in the channel dimension of the first feature map after the transform in both the column and row directions of the first feature map; the formulas of the three-dimensional discrete cosine transform and the three-dimensional inverse discrete cosine transform are as follows:
Figure FDA0002378240910000011
(three-dimensional discrete cosine transform);
Figure FDA0002378240910000012
(three-dimensional inverse discrete cosine transform);
wherein ,
Figure FDA0002378240910000013
wherein, N represents the number of columns of the first feature map, M represents the number of rows of the first feature map, L represents the number of channels of the first feature map, F (x, y, z) is the feature point at the y-th row and x-th column of the z-th channel, F (u, v, w) is the corresponding frequency feature after discrete cosine transform, and c (u), c (v), and c (w) are the corresponding compensation coefficients.
5. The method of claim 1, wherein the attention mechanism enhancing the weight matrix is performed by the following formula:
Fi,c(x)=(1+Hi,c(x)×Gi,c(x)),
wherein g (x) is the input of the frequency feature pyramid model, h (x) is a weight matrix generated by a softmax function, the range of which is [0, 1], f (x) is the feature after multi-scale information enhancement, i is the ith feature map channel, and c represents the point of the position c on the feature map.
6. The method of claim 5, further comprising training the pyramid model of frequency features, comprising:
constructing a training set, wherein the training set is composed of different feature graphs;
inputting the training set into a frequency characteristic pyramid model, and predicting the image target density;
calculating a difference value between the predicted value and the true value by using a loss function;
the loss function is minimized.
7. The method of claim 6, wherein the loss function is:
Figure FDA0002378240910000021
where Y is the actual density map, X is the input image, θ is the parameter of the frequency feature pyramid model, F (X, θ) represents the frequency feature pyramid model, gms (i) refers to the gradient amplitude similarity of the prediction map and the actual map at point i, and N is the total number of pixels in the input image.
8. The method of claim 7, wherein the gradient magnitude similarity is performed by the following formula:
Figure FDA0002378240910000022
wherein ,
Figure FDA0002378240910000023
wherein c is a normal number, YpIs a predicted density map, Y is an actual density map, mYp(i) Is the gradient magnitude, m, of the predicted density map at point iY(i) Is the gradient amplitude of the actual density map at point i, GMS (i) is predictedThe density map and the actual density map have a similarity in gradient magnitude at point i,
Figure FDA0002378240910000024
refers to volume and operation, hxRefer to the Prewitt operator, h, in the horizontal directionyRefer to the Prewitt operator in the vertical direction.
9. An apparatus for density prediction of an object in an image, comprising a processor and a memory, wherein the memory is configured to store program instructions;
the processor is used for reading the program instructions in the memory and executing the method for predicting the density of the target in the image according to any one of claims 1 to 8 according to the program instructions in the memory.
10. A computer-readable storage medium, characterized in that,
a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs a method of density prediction of an object in an image as claimed in any one of claims 1 to 8.
CN202010074908.2A 2020-01-22 2020-01-22 Method, device and medium for predicting density of target in image Active CN111310805B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010074908.2A CN111310805B (en) 2020-01-22 2020-01-22 Method, device and medium for predicting density of target in image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010074908.2A CN111310805B (en) 2020-01-22 2020-01-22 Method, device and medium for predicting density of target in image

Publications (2)

Publication Number Publication Date
CN111310805A true CN111310805A (en) 2020-06-19
CN111310805B CN111310805B (en) 2023-05-30

Family

ID=71161613

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010074908.2A Active CN111310805B (en) 2020-01-22 2020-01-22 Method, device and medium for predicting density of target in image

Country Status (1)

Country Link
CN (1) CN111310805B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113643261A (en) * 2021-08-13 2021-11-12 江南大学 Lung disease diagnosis method based on frequency attention network

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106779043A (en) * 2016-12-28 2017-05-31 南京沃顿物联网科技有限公司 A kind of method of counting based on number of people detection
US10043113B1 (en) * 2017-10-04 2018-08-07 StradVision, Inc. Method and device for generating feature maps by using feature upsampling networks
CN108921105A (en) * 2018-07-06 2018-11-30 北京京东金融科技控股有限公司 Identify the method, apparatus and computer readable storage medium of destination number
CN109101930A (en) * 2018-08-18 2018-12-28 华中科技大学 A kind of people counting method and system
CN110188685A (en) * 2019-05-30 2019-08-30 燕山大学 A kind of object count method and system based on the multiple dimensioned cascade network of double attentions
CN110263676A (en) * 2019-06-03 2019-09-20 上海眼控科技股份有限公司 A method of for generating high quality crowd density figure

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106779043A (en) * 2016-12-28 2017-05-31 南京沃顿物联网科技有限公司 A kind of method of counting based on number of people detection
US10043113B1 (en) * 2017-10-04 2018-08-07 StradVision, Inc. Method and device for generating feature maps by using feature upsampling networks
CN108921105A (en) * 2018-07-06 2018-11-30 北京京东金融科技控股有限公司 Identify the method, apparatus and computer readable storage medium of destination number
CN109101930A (en) * 2018-08-18 2018-12-28 华中科技大学 A kind of people counting method and system
CN110188685A (en) * 2019-05-30 2019-08-30 燕山大学 A kind of object count method and system based on the multiple dimensioned cascade network of double attentions
CN110263676A (en) * 2019-06-03 2019-09-20 上海眼控科技股份有限公司 A method of for generating high quality crowd density figure

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113643261A (en) * 2021-08-13 2021-11-12 江南大学 Lung disease diagnosis method based on frequency attention network

Also Published As

Publication number Publication date
CN111310805B (en) 2023-05-30

Similar Documents

Publication Publication Date Title
Deschaud et al. A fast and accurate plane detection algorithm for large noisy point clouds using filtered normals and voxel growing
US10885659B2 (en) Object pose estimating method and apparatus
Yang et al. Color-guided depth recovery from RGB-D data using an adaptive autoregressive model
CN107369166B (en) Target tracking method and system based on multi-resolution neural network
Wang et al. Land cover change detection at subpixel resolution with a Hopfield neural network
CN113168510A (en) Segmenting objects a priori by refining shape
Yang et al. A multi-task Faster R-CNN method for 3D vehicle detection based on a single image
Wang et al. Fast subpixel mapping algorithms for subpixel resolution change detection
CN104835130A (en) Multi-exposure image fusion method
US20220147732A1 (en) Object recognition method and system, and readable storage medium
Song et al. Extraction and reconstruction of curved surface buildings by contour clustering using airborne LiDAR data
Chen et al. Saliency-directed image interpolation using particle swarm optimization
Wang et al. Spatiotemporal subpixel mapping of time-series images
Pérez-Benito et al. Smoothing vs. sharpening of colour images: Together or separated
CN116934907A (en) Image generation method, device and storage medium
CN110827320A (en) Target tracking method and device based on time sequence prediction
CN103324753A (en) Image retrieval method based on symbiotic sparse histogram
CN113158970B (en) Action identification method and system based on fast and slow dual-flow graph convolutional neural network
CN111310805B (en) Method, device and medium for predicting density of target in image
Komatsu et al. Octave deep plane-sweeping network: reducing spatial redundancy for learning-based plane-sweeping stereo
CN114462486A (en) Training method of image processing model, image processing method and related device
CN111428809B (en) Crowd counting method based on spatial information fusion and convolutional neural network
Fernández-Caballero et al. Dynamic stereoscopic selective visual attention (DSSVA): Integrating motion and shape with depth in video segmentation
CN110136185B (en) Monocular depth estimation method and system
KR20210058638A (en) Apparatus and method for image processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Liang Yanyan

Inventor after: Yu Xiaoyuan

Inventor after: Lin Xuxin

Inventor after: Yu Chuntao

Inventor after: Yang Linlin

Inventor before: Liang Yanyan

Inventor before: Yu Xiaoyuan

Inventor before: Lin Xuxin

CB03 Change of inventor or designer information
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Tower C, 7 / F, Jinlong center, 105 xianxinghai Road, new port, Macau, China

Applicant after: China Energy International Development Investment Group Co.,Ltd.

Address before: Tower C, 7 / F, Jinlong center, 105 xianxinghai Road, new port, Macau, China

Applicant before: China Energy International Construction Investment Group Co.,Ltd.

GR01 Patent grant
GR01 Patent grant
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230517

Address after: A18, Jinlong Center, 105 Xianxinghai Road, New Port, Macau, China

Applicant after: China Energy International High tech Research Institute Co.,Ltd.

Address before: Tower C, 7 / F, Jinlong center, 105 xianxinghai Road, new port, Macau, China

Applicant before: China Energy International Development Investment Group Co.,Ltd.

TR01 Transfer of patent right

Effective date of registration: 20240412

Address after: Room 4202, Building 2, No. 522 Duhui Road, Hengqin New District, Zhuhai City, Guangdong Province

Patentee after: Boyan Technology (Zhuhai) Co.,Ltd.

Country or region after: China

Address before: A18, Jinlong Center, 105 Xianxinghai Road, New Port, Macau, China

Patentee before: China Energy International High tech Research Institute Co.,Ltd.

Country or region before: ????

TR01 Transfer of patent right