WO2021103187A1 - 图像处理方法及装置、处理器、电子设备、存储介质 - Google Patents

图像处理方法及装置、处理器、电子设备、存储介质 Download PDF

Info

Publication number
WO2021103187A1
WO2021103187A1 PCT/CN2019/125297 CN2019125297W WO2021103187A1 WO 2021103187 A1 WO2021103187 A1 WO 2021103187A1 CN 2019125297 W CN2019125297 W CN 2019125297W WO 2021103187 A1 WO2021103187 A1 WO 2021103187A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
convolution kernel
self
processed
feature
Prior art date
Application number
PCT/CN2019/125297
Other languages
English (en)
French (fr)
Inventor
陈航
朱烽
Original Assignee
深圳市商汤科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市商汤科技有限公司 filed Critical 深圳市商汤科技有限公司
Priority to SG11202106680UA priority Critical patent/SG11202106680UA/en
Priority to KR1020217013985A priority patent/KR20210075140A/ko
Priority to JP2021521482A priority patent/JP2022516398A/ja
Publication of WO2021103187A1 publication Critical patent/WO2021103187A1/zh
Priority to US17/348,878 priority patent/US20210312192A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • This application relates to the field of image processing technology, and in particular to an image processing method and device, processor, electronic equipment, and storage medium.
  • Traditional methods based on deep learning technology can process images in public places, extract feature information from the images, and determine the crowd density image corresponding to the image in the public place based on the feature information, and then determine the public place based on the crowd density image The number of people in the image of the place can be counted.
  • This application provides an image processing method and device, processor, electronic equipment, and storage medium.
  • an image processing method includes:
  • first convolution kernel to perform convolution processing on the image to be processed to obtain a first characteristic image
  • second convolution kernel to perform convolution processing on the image to be processed to obtain a second characteristic image
  • the first convolution kernel and the second convolution kernel with different receptive fields to perform convolution processing on the image to be processed, to extract the information describing the content of the image to be processed at different scales, and obtain the first convolution kernel and the second convolution kernel respectively.
  • a feature image and a second feature image are fused to use information describing the content of the image to be processed at different scales, thereby improving the accuracy of the obtained crowd density image corresponding to the image to be processed.
  • the method before the fusion processing is performed on the first characteristic image and the second characteristic image to obtain a first crowd density image, the method further includes:
  • the first feature extraction process and the second feature extraction process are performed on the image to be processed to extract the information of the image to be processed at different scales, and the first self-attention image and the second self-attention image are obtained.
  • Force image The first weight of the first feature image is determined based on the first self-attention image
  • the second weight of the second feature image is determined based on the second self-attention image
  • the first feature image and the second weight are determined based on the first weight and the second weight.
  • the fusion processing of the two feature images can improve the accuracy of the obtained first crowd density image.
  • the first characteristic image and the second characteristic image are fused according to the first weight and the second weight to obtain the first crowd density image ,include:
  • the first weight of the first characteristic image is determined according to the first self-attention image
  • the second weight of the second characteristic image is determined according to the second self-attention image.
  • the second weight includes:
  • the third self-attention image is used as the first weight, and the fourth self-attention image is used as the second weight.
  • the pixels at the same position in the first self-attention image and the second self-attention image can be made The sum of the pixel values of the points is 1. Then, by using the first self-attention image as the first weight and the second self-attention image as the second weight, the first feature image and the second feature image are fused, so that different image regions in the image to be processed can be executed differently.
  • the convolution processing of the receptive field further improves the accuracy of the obtained first crowd density image.
  • the method further includes:
  • the using the first convolution kernel to perform convolution processing on the to-be-processed image to obtain a first characteristic image, and using the second convolution kernel to perform convolution processing on the to-be-processed image to obtain a second characteristic image includes:
  • the performing a first feature extraction process on the image to be processed to obtain a first self-attention image, and performing a second feature extraction process on the image to be processed to obtain a second self-attention image includes:
  • the processing The processed image before using the first convolution kernel to perform convolution processing on the image to be processed to obtain the first feature image, and using the second convolution kernel to perform convolution processing on the image to be processed to obtain the second feature image, the processing The processed image performs a third feature extraction process to extract feature information of the image to be processed to obtain a fifth feature image.
  • the first convolution kernel and the second convolution kernel are both hollow convolution kernels, and the size of the first convolution kernel is the same as that of the second convolution kernel.
  • the size of is the same, and the weight of the first convolution kernel is the same as the weight of the second convolution kernel, and the expansion rate of the first convolution kernel is different from the expansion rate of the second convolution kernel.
  • the weight of the first convolution kernel and the weight of the second convolution kernel can be taken as The same, and the receptive field of the first convolution kernel can be different from the receptive field of the second convolution kernel.
  • the information contained in the first feature image obtained by convolution processing the image to be processed using the first convolution kernel and the information contained in the second feature image obtained by convolution processing the image to be processed using the second convolution kernel only exists The difference in scale.
  • the information of the image to be processed at different scales can be better used to improve the accuracy of the obtained first crowd density image.
  • the expansion rate of the first convolution kernel or the second convolution kernel is a reference value.
  • the method further includes: determining the sum of pixel values in the first crowd density image, and obtaining the number of people in the image to be processed.
  • the number of people in the image to be processed can be determined according to the first crowd density image.
  • the method is applied to a crowd counting network
  • the training process of the crowd counting network includes:
  • the trained crowd counting network is used to process the image to be processed, and a crowd density image corresponding to the image to be processed can be obtained.
  • the method before the obtaining the network loss based on the difference between the sample image and the second crowd density image, the method further includes:
  • the obtaining network loss based on the difference between the sample image and the second crowd density image includes:
  • the network loss is obtained.
  • the real crowd density image of the sample image is used as the supervision data of the crowd counting network, and the network loss of the crowd counting network is determined based on the difference between the real crowd density image and the second crowd density image.
  • the network loss of the crowd counting network is determined based on the difference between the real crowd density image and the second crowd density image.
  • the method before the sample image is processed through the crowd counting network to obtain a second crowd density image, the method further includes:
  • the processing the sample image via the crowd counting network to obtain a second crowd density image includes:
  • the obtaining network loss based on the difference between the sample image and the second crowd density image includes:
  • the network loss is obtained according to the difference between the target image in the at least one preprocessed image and the third crowd density image corresponding to the target image.
  • the sample image before the sample image is input to the crowd counting network, the sample image is preprocessed to obtain at least one preprocessed image, and the above at least one preprocessed image is used as The training data is input to the crowd counting network. In this way, the effect of expanding the training data set of the crowd counting network can be achieved.
  • the preprocessing includes at least one of: intercepting an image of a predetermined size from the sample image, and performing inversion processing on the sample image or the image of the predetermined size.
  • an image processing device in a second aspect, includes:
  • An acquiring unit configured to acquire an image to be processed, a first convolution kernel, and a second convolution kernel, where the receptive field of the first convolution kernel is different from the receptive field of the second convolution kernel;
  • the convolution processing unit is configured to use the first convolution kernel to perform convolution processing on the to-be-processed image to obtain a first characteristic image, and use the second convolution kernel to perform convolution processing on the to-be-processed image to obtain a second feature image.
  • the fusion processing unit is configured to perform fusion processing on the first characteristic image and the second characteristic image to obtain a first crowd density image.
  • the device further includes:
  • the feature extraction processing unit is configured to perform a first feature extraction process on the to-be-processed image before the fusion process is performed on the first feature image and the second feature image to obtain the first crowd density image to obtain
  • the first self-attention image, the second feature extraction process is performed on the image to be processed, and the second self-attention image is obtained.
  • the scale information of the image to be processed, and the scale information represented by the first self-attention image is different from the scale information represented by the second self-attention image;
  • a first determining unit configured to determine a first weight of the first characteristic image according to the first self-attention image, and determine a second weight of the second characteristic image according to the second self-attention image;
  • the fusion processing unit is used for:
  • the fusion processing unit is specifically configured to:
  • the first determining unit is configured to:
  • the third self-attention image is used as the first weight, and the fourth self-attention image is used as the second weight.
  • the feature extraction processing unit is further configured to perform convolution processing on the image to be processed using the first convolution kernel to obtain a first feature image, and use the first convolution kernel to obtain a first feature image.
  • a second convolution kernel performs convolution processing on the to-be-processed image to obtain a second feature image, performing a third feature extraction process on the to-be-processed image to obtain a fifth feature image;
  • the convolution processing unit is used for:
  • the feature extraction processing unit is also used for:
  • the first convolution kernel and the second convolution kernel are both hollow convolution kernels, and the size of the first convolution kernel is the same as that of the second convolution kernel.
  • the size of is the same, and the weight of the first convolution kernel is the same as the weight of the second convolution kernel, and the expansion rate of the first convolution kernel is different from the expansion rate of the second convolution kernel.
  • the expansion rate of the first convolution kernel or the second convolution kernel is a reference value.
  • the device further includes: a second determining unit configured to determine the sum of pixel values in the first crowd density image to obtain the number of people in the image to be processed.
  • the image processing method executed by the device is applied to a crowd counting network
  • the device further includes a training unit for training the crowd counting network, and the training process of the crowd counting network includes:
  • the training unit is further used to:
  • obtaining a real crowd density image of the sample image based on an impact function, a Gaussian kernel, and the sample image;
  • the network loss is obtained.
  • the training unit is further used to:
  • the sample image is preprocessed to obtain at least one preprocessed image
  • the network loss is obtained according to the difference between the target image in the at least one preprocessed image and the third crowd density image corresponding to the target image.
  • the preprocessing includes at least one of: intercepting an image of a predetermined size from the sample image, and performing inversion processing on the sample image or the image of the predetermined size.
  • a processor is provided, and the processor is configured to execute a method as described in the first aspect and any one of its possible implementation manners.
  • an electronic device including: a processor and a memory connected to each other, the memory is used to store computer program code, the computer program code includes computer instructions, when the processor executes the computer instructions At this time, the electronic device executes the method as in the above-mentioned first aspect and any one of its possible implementation modes.
  • a computer-readable storage medium stores a computer program.
  • the computer program includes program instructions that, when executed by a processor of an electronic device, cause The processor executes the method as described in the first aspect and any one of its possible implementation manners.
  • a computer program product containing instructions, which when the computer program product runs on a computer, causes the computer to execute the above-mentioned first aspect and any one of the possible implementation methods thereof.
  • FIG. 1 is a schematic flowchart of an image processing method provided by an embodiment of this application.
  • Fig. 2a is a schematic diagram of a convolution kernel provided by an embodiment of the application.
  • 2b is a schematic diagram of the weights of a convolution kernel provided by an embodiment of the application.
  • FIG. 3 is a schematic diagram of elements in the same position provided by an embodiment of the application.
  • FIG. 4 is a schematic diagram of a crowd image provided by an embodiment of this application.
  • FIG. 5 is a schematic flowchart of another image processing method provided by an embodiment of the application.
  • FIG. 6a is a schematic diagram of a hole convolution kernel provided by an embodiment of the application.
  • FIG. 6b is a schematic diagram of another hole convolution kernel provided by an embodiment of the application.
  • FIG. 7 is a schematic diagram of another hole convolution kernel provided by an embodiment of the application.
  • FIG. 8 is a schematic structural diagram of a crowd counting network provided by an embodiment of this application.
  • FIG. 9 is a schematic structural diagram of a scale-aware convolutional layer provided by an embodiment of the application.
  • FIG. 10 is a schematic structural diagram of an image processing device provided by an embodiment of the application.
  • FIG. 11 is a schematic diagram of the hardware structure of an image processing device provided by an embodiment of the application.
  • the image scale corresponding to a person near the image is large, and the image scale corresponding to a person far away in the image is small.
  • “far” refers to the distance between the real person corresponding to the person in the image and the imaging device that captures the image
  • “near” refers to the real person corresponding to the person in the image and the imaging device that captures the image. The distance between them is close.
  • the definition of a receptive field is the size of the area mapped on the input picture by the pixels on the feature map output by each layer of the convolutional neural network.
  • the receptive field of the convolution kernel is the receptive field used to perform convolution processing on the image using the convolution kernel.
  • the technical solutions provided by the embodiments of the present application can extract the scale information in the image, thereby improving the accuracy of determining the number of people.
  • FIG. 1 is a schematic flowchart of an image processing method provided by Embodiment (1) of the present application.
  • the execution subject of the embodiments of the present application may be terminal hardware such as servers, mobile phones, computers, and tablet computers.
  • the method provided in the embodiments of the present application may also be executed by a processor running computer executable code.
  • the above-mentioned image to be processed may be any image.
  • the image to be processed may contain a human object, where the image to be processed may only include a human face without the torso and limbs (the torso and limbs are referred to as the human body below), or may only include the human body, excluding the human face, or Only the lower or upper limbs are included.
  • This application does not limit the area of the human body specifically included in the image to be processed.
  • the image to be processed may contain animals.
  • the image to be processed may include plants. This application does not limit the content contained in the image to be processed.
  • the convolution kernel with channel 1 exists in the form of an n*n matrix, which contains n*n elements, and each element has a value.
  • the value of the element in the matrix is Is the weight of the convolution kernel.
  • both the first convolution kernel and the second convolution kernel can be convolution kernels of any size.
  • the weight of the first convolution kernel and the weight of the second convolution kernel can be any natural numbers.
  • the size of the first convolution kernel, the size of the second convolution kernel, the weight of the first convolution kernel, and The weight of the second convolution kernel is not limited.
  • the method for obtaining the image to be processed may be to receive the image to be processed input by the user through the input component, or may be to receive the image to be processed sent by the terminal.
  • the method for obtaining the first convolution kernel may be to receive the first convolution kernel input by the user through the input component, or may be the first convolution kernel sent by the receiving terminal.
  • the manner of obtaining the second convolution kernel may be to receive the second convolution kernel input by the user through the input component, or may be the second convolution kernel sent by the receiving terminal.
  • the above-mentioned input components include: a keyboard, a mouse, a touch screen, a touch pad, and an audio input device.
  • the aforementioned terminals include mobile phones, computers, tablets, servers, and so on.
  • both the first feature image and the second feature image contain information for describing the content of the image to be processed, but the scale of the information contained in the first feature image is different from the scale of the information contained in the second feature image.
  • the crowd density image includes crowd density information.
  • the pixel value of each pixel in the crowd density image represents the number of people at that pixel. For example, if the pixel value of pixel A in the crowd density image is 0.05, then there are 0.05 people at pixel A.
  • the image area covered by a person contains at least one pixel, when the image area covered by a person is 1 pixel, the pixel value corresponding to the pixel is 1, and when the image area covered by a person is When there are at least two pixels, the sum of the pixel values of the at least two pixels is 1. Therefore, the range of pixel values in the crowd density image is greater than or equal to 0 and less than or equal to 1.
  • the above-mentioned first crowd density image is a crowd density image corresponding to the image to be processed, and may represent the crowd density distribution in the image to be processed.
  • the size of the first crowd density image is the same as the size of the image to be processed.
  • the size of the image in this embodiment refers to the width and height of the image.
  • the pixel value of the first pixel in the first crowd density image can be used to characterize the number of people at the second pixel in the image to be processed.
  • the position of the first pixel in the first crowd density image is the same as the position of the second pixel in the image to be processed.
  • the pixels at the same position in the two images can be seen in Fig. 3.
  • the position of the pixel A 11 in the image A is the same as the position of the pixel B 11 in the image B.
  • the position of the point A 12 in the image A is the same as the position of the pixel k in the image B 12
  • the position of the pixel A 13 in the image A is the same as the position of the pixel B 13 in the image B
  • the pixel point A 21 is at
  • the position in image A is the same as the position of pixel B 21 in image B
  • the position of pixel A 22 in image A is the same as the position of pixel B 22 in image B
  • the position is the same as the position of pixel point B 23 in image B
  • the position of pixel point A 31 in image A is the same as the position of pixel point B 31 in image B
  • the position of pixel point A 32 in image A is the same as that of pixel point
  • the pixel point x in the image X is the same as the position of the pixel point y in the image Y, it is a succinct expression.
  • the pixel point x is referred to as the pixel point in the image X with the same position as the pixel point y, or the pixel point y is called the pixel point in the image Y that is at the same position as the pixel point x.
  • the first feature image and the second feature image are fused Processing (for example, pixel value weighting processing of corresponding positions, etc.), the information describing the image content of the image to be processed at different scales can be used to generate the crowd density image corresponding to the image to be processed, that is, the first crowd density image. In this way, the accuracy of the obtained crowd density image corresponding to the image to be processed can be improved, thereby improving the accuracy of the number of people in the obtained image to be processed.
  • this embodiment illustrates that two convolution kernels with different receptive fields (that is, the first convolution kernel and the second convolution kernel) are used to perform convolution processing on the image to be processed to obtain descriptions at two scales.
  • Information about the image content of the image to be processed it is also possible to perform convolution processing on the image to be processed through three or more convolution kernels with different receptive fields to obtain three or more scales describing the image content of the image to be processed.
  • Information, and the information describing the image content of the image to be processed under the three or more scales are merged to obtain a crowd density image corresponding to the image to be processed.
  • the number of people in the image to be processed can be obtained by determining the sum of the pixel values of all pixels in the first crowd density image.
  • the first convolution kernel and the second convolution kernel with different receptive fields are used to perform convolution processing on the image to be processed respectively, so as to extract information describing the content of the image to be processed at different scales, and obtain the first features respectively.
  • Image and second feature image Through the fusion processing of the first feature image and the second feature image, the information describing the content of the image to be processed at different scales can be used to improve the accuracy of the obtained crowd density image corresponding to the image to be processed, thereby improving the obtained image to be processed. The accuracy of processing the number of people in the image.
  • the area of the image area covered by the people in the vicinity is larger than the area of the image area covered by the people in the distance.
  • the person A in FIG. 4 is a close person compared to the person B, and the area of the image area covered by the person A is larger than the area of the image area covered by the person B.
  • the scale of the image area covered by the people in the vicinity is large, and the scale of the image area covered by the people in the distance is small. Therefore, the area of the image area covered by the person is positively correlated with the scale of the image area covered by the person.
  • the information of the image area covered by the person obtained by the convolution process is the richest (the richest information of the image area covered by the person will be obtained below
  • the receptive field is called the best receptive field in the area covered by the character).
  • the scale of the image area covered by the person is positively correlated with the best receptive field of the area covered by the person.
  • Embodiment (1) uses the first convolution kernel and the second convolution kernel with different receptive fields to perform convolution processing on the image to be processed respectively to obtain information describing the content of the image to be processed in different scales.
  • the receptive field of the first convolution kernel and the receptive field of the second convolution kernel are fixed, and the scales of different image regions in the image to be processed are different, so the first convolution kernel and the second convolution kernel are used respectively Convolution processing of the image to be processed cannot obtain the best receptive field of each image area in the image to be processed, that is, it is impossible to obtain the most abundant information of different image areas in the image to be processed.
  • the embodiment of the present application also provides a method for assigning weights to the first feature image and the second feature image when the first feature image and the second feature image are fused, so as to achieve different scales in the image to be processed.
  • the image area undergoes convolution processing of different receptive fields to obtain richer information.
  • FIG. 5 is a schematic flowchart of another image processing method provided by Embodiment (2) of the present application.
  • the first self-attention image and The second self-attention images are all used to represent the scale information of the image to be processed, and the scale information represented by the first self-attention image is different from the scale information represented by the second self-attention image.
  • the feature extraction processing may be convolution processing, pooling processing, or a combination of convolution processing and pooling processing. This application does not limit the implementation of the first feature extraction process and the implementation of the second feature extraction process.
  • the image to be processed is sequentially convolved through multiple layers of convolution layers to implement the first feature extraction process of the image to be processed, and the first self-attention image is obtained.
  • the image to be processed can be sequentially convolved through multiple convolution layers to achieve the second feature extraction process of the image to be processed, and the second self-attention image can be obtained.
  • the image to be processed may be subjected to the first feature image.
  • Three feature extraction processing to extract feature information of the image to be processed to obtain a fifth feature image.
  • the size of the first self-attention image and the size of the second self-attention image are both the same as the size of the image to be processed.
  • Both the first self-attention image and the second self-attention image can be used to represent the scale information of the image to be processed (that is, the scale of different image regions in the image to be processed), and the scale information represented by the first self-attention image It is different from the scale information represented by the second self-attention image.
  • images including: the above-mentioned first characteristic image, the above-mentioned second characteristic image, the above-mentioned first self-attention image, the above-mentioned second self-attention image, the third self-attention image mentioned below, etc.
  • the scale of matches the receptive field of the convolution kernel used in the feature extraction process (including the first feature extraction process, the second feature extraction process, and the third feature extraction process) of the image to be processed.
  • the scale of the image obtained by convolving the image with the size of 3*3 convolution kernel is a
  • the scale of the image obtained by using the convolution kernel of 5*5 to convolve the image is b
  • the scale of the self-attention image obtained by the convolution kernel with a size of 3*3 for feature extraction processing of the image to be processed is a (that is, the self-attention image can represent the information of the image to be processed at scale a), and the size is 5*
  • the scale of the feature image obtained by performing feature extraction processing on the image to be processed by the convolution kernel of 5 is b.
  • the first self-attention image represents the information of the image to be processed at scale a
  • the second self-attention image represents the information of the image to be processed at scale b, where the scale a is greater than the scale b.
  • the range of the pixel value of the pixel point in the first self-attention image and the pixel value of the pixel point in the second self-attention image are both: greater than or equal to 0 and less than or equal to 1.
  • the pixel value of a certain pixel in the first self-attention image (or the second self-attention image) is closer to 1, which indicates that the optimal scale of the pixel in the image to be processed is the same as that of the pixel.
  • the optimal scale is the scale corresponding to the optimal receptive field of the pixel.
  • Example 1 continues with the example.
  • Pixel a and pixel b are two different pixels in the first self-attention image
  • pixel c is the difference between pixel a and pixel a in the first self-attention image in the image to be processed.
  • the pixel point d is the same pixel point in the image to be processed as the position of the pixel point b in the first self-attention image. If the pixel value of the pixel point a is 0.9, the pixel value of the pixel point b is 0.7. Then the difference between the optimal scale of the pixel point c and the scale a is smaller than the difference between the optimal scale of the pixel point d and the scale a.
  • the scale represented by the first self-attention image is the same as the scale of the first feature image
  • the scale represented by the second self-attention image is the same as the scale of the second feature image.
  • the pixel value of the pixel in the first self-attention image is closer to 1 to represent the optimal scale of the pixel in the first feature image that is the same as the position of the pixel in the first self-attention image and the first feature
  • the scale is closer to the scale of the second feature image.
  • the first weight of the first feature image can be determined according to the first self-attention image to adjust the scale of the pixel points in the first feature image, so that the pixel points in the first feature image are closer to the optimal scale.
  • the second weight of the second feature image can be determined according to the second self-attention image to adjust the scale of the pixels in the second feature image, so that the pixels in the second feature image are closer to the optimal scale.
  • the first self-attention image and the second self-attention image can be normalized to obtain the third self-attention image and the second self-attention image corresponding to the first self-attention image.
  • the third self-attention image is used as the above-mentioned first weight
  • the fourth self-attention image is used as the above-mentioned second weight.
  • the pixels at the same position in the first self-attention image and the second self-attention image can be made The sum of the pixel values is 1. For example, if the position of pixel a in the first self-attention image is the same as the position of pixel b in the second self-attention image, then the first self-attention image and the second self-attention image are classified The sum of the pixel value of the pixel point a and the pixel value of the pixel point b after the unified processing is 1.
  • the position of pixel c in the third self-attention image is the same as the position of pixel a in the first self-attention image
  • the position of pixel d in the fourth self-attention image is the same as that of pixel b in the second self-attention image. If the position in the self-attention image is the same, the sum of the pixel value of the pixel point c and the pixel value of the pixel point d is 1.
  • the aforementioned normalization processing can be implemented by inputting the first self-attention image and the second self-attention image to the softmax function respectively.
  • the first self-attention image and the second self-attention image both contain images of multiple channels, then the images of the same channel in the first self-attention image and the second self-attention image are input to softmax function.
  • the first self-attention image and the second self-attention image both contain images of 2 channels, when the first self-attention image and the second self-attention image are normalized, the first self-attention image and the second self-attention image can be normalized.
  • the image of the first channel in the self-attention image and the image of the first channel in the second self-attention image are input to the softmax function to obtain the image of the first channel in the third self-attention image and the fourth self-attention The image of the first channel in the image.
  • the receptive field of the convolution process for obtaining the first feature image is different from the receptive field of the convolution process for obtaining the second feature image.
  • the third self-attention image as the first weight of the first feature image
  • the fourth self-attention image as the second weight of the second feature image
  • the dot product between the first weight and the first feature image is calculated .
  • the third characteristic image calculates the dot product between the second weight and the second characteristic image, and obtain the fourth characteristic image.
  • the first feature extraction process and the second feature extraction process are respectively performed on the image to be processed to extract the information of the image to be processed at different scales to obtain the first self-attention image and the second self-attention image.
  • the first weight of the first feature image is determined based on the first self-attention image
  • the second weight of the second feature image is determined based on the second self-attention image
  • the first feature image and the second weight are determined based on the first weight and the second weight.
  • the fusion processing of the two feature images can improve the accuracy of the obtained first crowd density image.
  • the first convolution kernel is used to perform convolution processing on the feature information extracted from the image to be processed
  • the focus of is different from the focus of the feature information extracted by the convolution processing of the image to be processed using the second convolution kernel.
  • using the first convolution kernel to perform convolution processing on the image to be processed focuses on extracting the attributes of the person in the image to be processed (such as clothes color, pants length), and using the second convolution kernel to perform convolution processing on the image to be processed Focus on extracting the contour features of the person in the image to be processed (the contour feature can be used to identify whether the image to be processed contains a person).
  • the embodiment of the present application also provides a technical solution, which takes the weight of the first convolution kernel and the weight of the second convolution kernel to be the same, so as to reduce the fusion of the first feature image and the second feature image.
  • the fusion of non-scale information during processing improves the effect of scale information fusion, and further improves the accuracy of the obtained first crowd density image.
  • the first convolution kernel and the second convolution kernel are both hollow convolution kernels, and the size of the first convolution kernel is the same as the size of the second convolution kernel, and the first convolution kernel The weight of the convolution kernel is the same as the weight of the second convolution kernel, and the expansion rate of the first convolution kernel is different from the expansion rate of the second convolution kernel.
  • the size of the above two hole convolution kernels are both 3*3.
  • the hole convolution kernel shown in Figure 6a and the one shown in Figure 6b The black area in the hole convolution kernel shown indicates that there are parameters, and the white part indicates that there are no parameters (that is, the parameter is 0).
  • the weight of the hole convolution kernel shown in FIG. 6a may be the same as the weight of the hole convolution kernel shown in FIG. 6b.
  • the expansion rate of the hole convolution kernel shown in Fig. 6a is 2, the expansion rate of the hole convolution kernel shown in Fig.
  • the expansion rate of the hole convolution kernel shown in Fig. 6a is 1.
  • the receptive field is different from the receptive field of the cavity convolution kernel shown in Fig. 6b. Specifically, the receptive field of the cavity convolution kernel shown in Fig. 6a (5*5) is higher than that of the cavity convolution kernel shown in Fig. 6b. (3*3) Big.
  • the weight of the first convolution kernel and the weight of the second convolution kernel can be set to be the same, and the first convolution can be made
  • the receptive field of the core is different from the receptive field of the second convolution kernel.
  • the weight of the first convolution kernel and the weight of the second convolution kernel can be the same by making the first convolution kernel and the second convolution kernel share the same set of weights.
  • the first convolution kernel and the second convolution kernel can be used separately in the subsequent steps.
  • the convolution kernel and the second convolution kernel perform convolution processing on the image to be processed, the number of parameters to be processed can be reduced.
  • the receptive field of the hole convolution kernel is positively correlated with the expansion rate of the hole convolution kernel.
  • the expansion rate of the hole convolution kernel is 1, the receptive field of the hole convolution kernel is the same as that of the conventional convolution kernel of the same size.
  • the expansion rate of the hole convolution kernel shown in Figure 6b is 1.
  • the receptive field of the hollow convolution kernel is the same as the receptive field of the conventional convolution kernel with a size of 3*3.
  • the embodiment of the present application also provides a method for setting the expansion rate of the hole convolution kernel to 0 (ie a reference value), so that the receptive field of the hole convolution kernel is smaller than that of the conventional convolution kernel, so as to better extract The information of the smaller image area in the image to be processed is displayed.
  • x and y are respectively the position of the center pixel of the hole convolution kernel when the hole convolution kernel slides to a certain pixel on the image to be processed.
  • (x+i,y+i) is the coordinates of the sampling point in the image to be processed in the image to be processed
  • w (1+i,1+i) is the weight of the hole convolution kernel
  • b is the hole convolution kernel deviation.
  • I is the image to be processed
  • O is the feature image obtained by convolution processing the image to be processed using the hole convolution kernel.
  • w′ k represents the weight of the conventional convolution kernel with a size of 1*
  • b′ k represents the deviation of the conventional convolution kernel with a size of 1*1.
  • FIG. 7 shows a hole convolution kernel with a size of 3*3 and an expansion rate of 0.
  • the black area in the hole convolution kernel shown in Fig. 6 is the position of the weight. It can be seen from the hole convolution kernel shown in FIG. 6 that the receptive field of the hole convolution kernel with an expansion rate of 0 is 1.
  • the first convolution kernel when the first convolution kernel is a hole convolution kernel, by setting the expansion rate of the first convolution kernel to 0, the first convolution kernel can be used to perform convolution processing on the image to be processed. At this time, the convolution processing of the receptive field of the image to be processed is implemented to better extract the information of the small-scale image area in the image to be processed.
  • FIG. 8 is a schematic structural diagram of a crowd counting network provided by an embodiment of this application. As shown in Figure 8, the network layers in the crowd counting network are connected in series, including 11 layers of convolutional layers, 9 layers of pooling layers, and 6 layers of scale-aware convolutional layers.
  • the image to be processed is input to the crowd counting network, and the image to be processed is processed by the first layer of convolutional layer to obtain the image output by the first layer of convolutional layer, and the image output by the first layer of convolutional layer is processed by the second layer of convolutional layer.
  • the image output by the second convolutional layer is processed, and the image output by the second convolutional layer is processed by the first pooling layer to obtain the image output by the first pooling layer,..., the output of the tenth convolutional layer
  • the image of is processed by the first scale-aware convolutional layer to obtain the image output by the first scale-aware convolutional layer,..., the image output by the ninth pooling layer is processed by the eleventh convolutional layer
  • the size of the convolution kernel in all convolutional layers except the eleventh convolutional layer in the crowd counting network can be 3*3, and the convolution kernel in the eleventh convolutional layer The size is 1*1.
  • the number of convolution kernels in the first convolutional layer and the number of convolution kernels in the second convolutional layer can both be 64.
  • the number of convolution kernels in the third convolutional layer is the same as that in the fourth convolutional layer.
  • the number of convolution kernels can be 128, the number of convolution kernels in the fifth convolution layer, the number of convolution kernels in the sixth convolution layer, and the number of convolution kernels in the seventh convolution layer.
  • the number of convolution kernels in the eighth convolutional layer, the number of convolution kernels in the ninth convolutional layer, and the number of convolution kernels in the tenth convolutional layer can all be 512, the eleventh layer
  • the number of convolution kernels in the convolution layer is 1.
  • the pooling layer in the crowd counting network can be the maximum pooling layer or the average pooling layer, which is not limited in this application.
  • the structure diagram of the scale-aware convolutional layer can be seen in FIG. 9.
  • the scale-aware convolutional layer includes three hollow convolution kernels and a self-attention module.
  • the structures of the above-mentioned three hole convolution kernels can be seen in Fig. 6a, Fig. 6b and Fig. 7, which will not be repeated here.
  • the above-mentioned self-attention module contains 3 parallel convolutional layers.
  • the input image of the scale-aware convolutional layer is processed by the hole convolution kernels of three different receptive fields to obtain the sixth feature image, the seventh feature image, and the eighth feature image, respectively.
  • the input image of the scale-aware convolutional layer is processed by the convolution of the three convolutional layers in the self-attention module to obtain the fifth self-attention image, the sixth self-attention image, and the seventh self-attention image respectively.
  • the scale of the sixth feature image is the same as that of the fifth self-attention image
  • the scale of the seventh feature image is the same as that of the sixth self-attention image
  • the scale of the eighth feature image is the same as the scale of the seventh self-attention image.
  • the fifth self-attention image and the sixth feature image are dot-multiplied to obtain the ninth feature image
  • the sixth self-attention image and the seventh feature image are dot-multiplied to obtain the tenth feature image
  • the seventh self-attention image and The eighth feature image is dot-multiplied to obtain the eleventh feature image.
  • the ninth feature image, the tenth feature image, and the eleventh feature image are fused to obtain the output image of the scale-aware convolutional layer.
  • the optional fusion processing described above may be to add the pixel values of the pixels at the same position in the two images to be fused.
  • this application also provides a method for training a crowd counting network.
  • the training method may include the following steps: obtaining sample images.
  • the sample image is processed through the crowd counting network to obtain the second crowd density image.
  • the network loss is obtained. Adjust the parameters of the crowd counting network based on the network loss.
  • the above-mentioned sample image can be any digital image.
  • the sample image may contain human objects, where the sample image may only include the human face without the torso and limbs (the torso and limbs are referred to as the human body below), or may only include the human body, excluding the human face, or only include Lower limbs or upper limbs.
  • This application does not limit the region of the human body specifically included in the sample image.
  • the sample image may contain animals.
  • the sample image may contain plants. This application does not limit the content contained in the sample image.
  • the network loss of the crowd counting network can be determined according to the difference between the sample image and the second crowd density image.
  • the above difference may be the difference between the pixel values of the pixel points at the same position in the sample image and the second crowd density image.
  • the pixel value of the pixel in the sample image in the embodiment of the application can be used to characterize whether there is a person at the pixel. For example, the image area covered by the person A in the sample image includes pixel a, pixel b, and pixel c, then The pixel value of pixel point a, the pixel value of pixel point b, and the pixel value of pixel point c are all 1. If the pixel point d in the sample image does not belong to the image area covered by the person, the pixel value of the pixel point is 0.
  • the parameters of the crowd counting network can be adjusted by means of reverse gradient propagation based on the network loss until the crowd counting network converges, and the training of the crowd counting network is completed.
  • the pixel value of the pixel in the sample image is not 0 or 1
  • the pixel value of the pixel in the second crowd density image is greater than or equal to 0 and less than or equal to 1. Therefore, based on the difference between the sample image and the second crowd density image, it is determined that there is a large difference in the network loss of the crowd counting network.
  • the real crowd density image of the sample image can be used as the supervision information.
  • the difference between the crowd density image and the second crowd density image determines the network loss of the crowd counting network, so as to improve the accuracy of the obtained network loss.
  • the real crowd density image of the sample image can be obtained.
  • the person tag image of the sample image can be obtained according to the impact function, and the pixel value of the pixel in the person tag image is used to characterize whether the pixel belongs to the image area covered by the person.
  • the above-mentioned person label image satisfies the following formula:
  • N is the total number of people in the sample image.
  • x i is the position of the central character in the image area covered by the image of the sample, it is used to represent the person.
  • ⁇ (xx i ) is the impact function of the position of the center of the image area covered by the person in the sample image in the sample image. If there is a person at x in the sample image, ⁇ (x) is equal to 1, and if there is no person at x in the sample image, ⁇ (x) is equal to 0.
  • the real crowd density image of the sample image can be obtained.
  • the process satisfies the following formula:
  • x i in formula (3) is the position of the center of the image area covered by the head of the person in the sample image (hereinafter referred to as the center of the head area) in the sample image
  • ⁇ (xx i ) is the sample image The impact function of the position of the center of the head region in the image. If there is a human head at x in the sample image, ⁇ (x) is equal to 1, and if there is no human head at x in the sample image, ⁇ (x) is equal to 0.
  • Gaussian check is used to perform convolution processing on the above-mentioned person label image to obtain a real crowd density image of the sample image.
  • the average distance between the center of (the target head here is the head closest to the i-th head in the person label image), usually between the size of the head and the center of two adjacent people in a crowded scene related to the distance, d i is approximately equal to the size of the head in the case where dense populations.
  • the scale of the image area is positively correlated.
  • the network loss of the crowd counting network can be determined according to the difference between the pixel values of the pixel points in the same position in the real crowd density image and the second crowd density image. For example, the sum of the differences between the pixel values of all the pixel points at the same position in the real crowd density image and the second crowd density image is used as the network loss of the crowd counting network.
  • the sample image before inputting the sample image to the crowd counting network, the sample image may be preprocessed to obtain at least one preprocessed image, and the above at least one preprocessed image is input to the crowd as training data Count the network.
  • the effect of expanding the training data set of the crowd counting network can be achieved.
  • the above-mentioned preprocessing includes at least one of intercepting an image of a predetermined size from a sample image, and performing inversion processing on the sample image or the image of the predetermined size.
  • the predetermined size can be 64*64.
  • Reversal processing of the sample image includes: horizontal mirror reversal processing.
  • preprocessed images For example, by dividing the sample image along the horizontal center axis and the vertical center axis of the sample image, 4 preprocessed images can be obtained. At the same time, 5 images of a predetermined size are randomly cut from the sample images, and 5 preprocessed images can be obtained. So far, 9 pre-processed images have been obtained. Performing horizontal mirror inversion processing on the 9 pre-processed images can obtain 9 inverted images, that is, another 9 pre-processed images. In this way, 18 preprocessed images can be obtained.
  • At least one third crowd density image can be obtained, where each pre-processed image corresponds to a third crowd density image.
  • Example 2 input the three pre-processed images of image A, image B, and image C into the crowd counting network respectively, and the crowd density image a corresponding to image A and the crowd density corresponding to image B will be obtained respectively Image b, the crowd density image c corresponding to image C.
  • the crowd density image a, the crowd density image b, and the crowd density image c can all be called the third crowd density image.
  • the network loss of the crowd counting network can be obtained.
  • Example 2 continues with an example.
  • the first difference can be obtained according to the difference between image A and image a
  • the second difference can be obtained according to the difference between image B and image b
  • the second difference can be obtained according to the difference between image C and image c.
  • the third difference Summing the first difference, the second difference, and the third difference can obtain the network loss of the crowd counting network.
  • This embodiment provides a crowd counting network, using the crowd counting network to process images to be processed, a crowd density image corresponding to the image to be processed can be obtained, and then the number of people in the image to be processed can be determined.
  • the embodiments of the present application also provide several possible application scenarios:
  • Scenario A As mentioned above, too much crowds often occur in public places due to excessive traffic, and then some public accidents occur. How to count the crowds in public places is of great significance.
  • surveillance camera equipment will be installed in various public places in order to carry out security protection based on the video stream information.
  • technical solutions provided by the embodiments of the present application to process the video streams collected by the surveillance camera equipment can determine the number of people in public places, thereby effectively preventing the occurrence of public accidents.
  • the server of the video stream processing center of the surveillance camera device can execute the technical solution provided in the embodiment of the present application, and the server can be connected to at least one surveillance camera. After obtaining the video stream sent by the surveillance camera, the server can use the technical solution provided in the embodiment of the present application to process each frame of the video stream to determine the number of people in each frame of the video stream. In the case where the number of people in the image is greater than or equal to the number threshold, the server can send instructions to related devices to prompt or alarm. For example, the server may send an instruction to the camera that collects the image, and the instruction is used to instruct the camera that collects the image to give an alarm. For another example, the server may send an instruction to the terminal of the management personnel in the area where the camera that collects the image is located, and the instruction is used to prompt the terminal to output prompt information that the number of people exceeds the threshold of the number of people.
  • Scenario B The flow of people in different areas of the shopping mall is different. Placing the main product in a high-traffic area for display can effectively increase the sales of the main product. Therefore, how to accurately determine the flow of people in different areas of the shopping mall is very important for the business. Meaning. For example, there are area A, area B, and area C in a shopping mall, and area B has the largest traffic. Based on this, the merchant can place the main product in area B for display to increase the sales of the main product.
  • the server of the management and control center of the video stream of the surveillance camera of the shopping mall can execute the technical solution provided in the embodiment of the present application, and the server can be connected to at least one surveillance camera. After obtaining the video stream sent by the surveillance camera, the server can use the technical solution provided in the embodiment of the present application to process each frame of the video stream to determine the number of people in each frame of the video stream. According to the number of people in each frame of the image, the flow of people in the area monitored by different cameras in a certain period of time can be determined, and then the flow of people in different areas in the shopping mall can be determined. For example, there are area A, area B, area C, camera A, camera B, and camera C in a shopping mall.
  • Camera A monitors area A
  • camera B monitors area B
  • camera C monitors area C.
  • the server uses the technical solution provided by the embodiments of the application to process the images in the video stream collected by camera A, and determines that the average daily traffic of area A in the past week is 900, and determines that area B has an average daily flow rate in the past week
  • the flow of people is 200. It is determined that the average daily flow of people in area C in the past week is 600.
  • area A has the most traffic, so the merchant can place the main product in area A for display, so as to increase the sales of the main product.
  • the writing order of the steps does not mean a strict execution order but constitutes any limitation on the implementation process.
  • the specific execution order of each step should be based on its function and possibility.
  • the inner logic is determined.
  • FIG. 10 is a schematic structural diagram of an image processing device provided by an embodiment of the application.
  • the device 1 includes: an acquisition unit 11, a convolution processing unit 12, a fusion processing unit 13, a feature extraction processing unit 14, and a second A determination unit 15, a second determination unit 16, and a training unit 17. among them:
  • the acquiring unit 11 is configured to acquire an image to be processed, a first convolution kernel, and a second convolution kernel, where the receptive field of the first convolution kernel is different from the receptive field of the second convolution kernel;
  • the convolution processing unit 12 is configured to use the first convolution kernel to perform convolution processing on the to-be-processed image to obtain a first characteristic image, and use the second convolution kernel to perform convolution processing on the to-be-processed image to obtain a first characteristic image.
  • the fusion processing unit 13 is configured to perform fusion processing on the first feature image and the second feature image to obtain a first crowd density image.
  • the device 1 further includes:
  • the feature extraction processing unit 14 is configured to perform a first feature extraction process on the to-be-processed image before the fusion process is performed on the first feature image and the second feature image to obtain a first crowd density image, Obtain a first self-attention image, perform a second feature extraction process on the image to be processed, and obtain a second self-attention image. Both the first self-attention image and the second self-attention image are used for characterization The scale information of the image to be processed, and the scale information represented by the first self-attention image is different from the scale information represented by the second self-attention image;
  • the first determining unit 15 is configured to determine the first weight of the first characteristic image according to the first self-attention image, and determine the second weight of the second characteristic image according to the second self-attention image;
  • the fusion processing unit 13 is used to:
  • the fusion processing unit 13 is specifically configured to:
  • the first determining unit 15 is configured to:
  • the third self-attention image is used as the first weight, and the fourth self-attention image is used as the second weight.
  • the feature extraction processing unit 14 is further configured to perform convolution processing on the image to be processed using the first convolution kernel to obtain a first feature image, and use the Before the second convolution kernel performs convolution processing on the to-be-processed image to obtain a second feature image, performing a third feature extraction process on the to-be-processed image to obtain a fifth feature image;
  • the convolution processing unit 12 is used to:
  • the feature extraction processing unit 14 is further configured to:
  • the first convolution kernel and the second convolution kernel are both hollow convolution kernels, and the size of the first convolution kernel is the same as that of the second convolution kernel.
  • the size of is the same, and the weight of the first convolution kernel is the same as the weight of the second convolution kernel, and the expansion rate of the first convolution kernel is different from the expansion rate of the second convolution kernel.
  • the expansion rate of the first convolution kernel or the second convolution kernel is a reference value.
  • the device 1 further includes: a second determining unit 16 configured to determine the sum of pixel values in the first crowd density image to obtain the number of people in the image to be processed.
  • the image processing method executed by the apparatus 1 is applied to a crowd counting network
  • the device 1 further includes a training unit 17 for training the crowd counting network, and the training process of the crowd counting network includes:
  • the training unit 17 is further used to:
  • obtaining a real crowd density image of the sample image based on an impact function, a Gaussian kernel, and the sample image;
  • the network loss is obtained.
  • the training unit 17 is further used to:
  • the sample image is preprocessed to obtain at least one preprocessed image
  • the network loss is obtained according to the difference between the target image in the at least one preprocessed image and the third crowd density image corresponding to the target image.
  • the preprocessing includes at least one of: intercepting an image of a predetermined size from the sample image, and performing inversion processing on the sample image or the image of the predetermined size.
  • the first convolution kernel and the second convolution kernel with different receptive fields are used to perform convolution processing on the image to be processed respectively, so as to extract information describing the content of the image to be processed at different scales, and obtain the first features respectively.
  • Image and second feature image Through the fusion processing of the first feature image and the second feature image, the information describing the content of the image to be processed at different scales can be used to improve the accuracy of the obtained crowd density image corresponding to the image to be processed, thereby improving the obtained image to be processed. The accuracy of processing the number of people in the image.
  • the functions or modules contained in the device provided in the embodiments of the present disclosure can be used to execute the methods described in the above method embodiments.
  • the functions or modules contained in the device provided in the embodiments of the present disclosure can be used to execute the methods described in the above method embodiments.
  • FIG. 11 is a schematic diagram of the hardware structure of an image processing device provided by an embodiment of the application.
  • the image processing device 2 includes a processor 21, a memory 22, and may also include an input device 23 and an output device 24.
  • the processor 21, the memory 22, the input device 23, and the output device 24 are coupled through a connector, and the connector includes various types of interfaces, transmission lines or buses, etc., which are not limited in the embodiment of the present application. It should be understood that in the various embodiments of the present application, coupling refers to mutual connection in a specific manner, including direct connection or indirect connection through other devices, such as connection through various interfaces, transmission lines, buses, and the like.
  • the processor 21 may be one or more graphics processing units (GPUs).
  • the GPU may be a single-core GPU or a multi-core GPU.
  • the processor 21 may be a processor group composed of multiple GPUs, and the multiple processors are coupled to each other through one or more buses.
  • the processor may also be other types of processors, etc., which is not limited in the embodiment of the present application.
  • the memory 22 may be used to store computer program instructions and various types of computer program codes including program codes used to execute the solutions of the present application.
  • the memory includes, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM) ), or a portable read-only memory (compact disc read-only memory, CD-ROM), which is used for related instructions and data.
  • the input device 23 is used to input data and signals, and the output device 24 is used to output data and signals.
  • the input device 23 and the output device 24 may be independent devices or a whole device.
  • the memory 22 can be used not only to store related instructions, but also to store related images.
  • the memory 22 can be used to store images to be processed obtained through the input device 23, or the memory 22 can also be used to store images to be processed.
  • the first crowd density image and the like obtained by the processor 21 are stored, and the embodiment of the present application does not limit the specific data stored in the memory.
  • FIG. 11 only shows a simplified design of the image processing device.
  • the image processing device may also contain other necessary components, including but not limited to any number of input/output devices, processors, memories, etc., and all image processing devices that can implement the embodiments of this application are in this application. Within the scope of protection applied for.
  • the embodiment of the present application also provides a processor.
  • the cache of the processor can store a computer program.
  • the processor can execute the embodiment (1) and the embodiment (2). Provide the technical solution or realize the processing of the image to be processed by the trained crowd counting network.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above embodiments it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • software it can be implemented in the form of a computer program product in whole or in part.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted through the computer-readable storage medium.
  • the computer instructions can be sent from a website, computer, server, or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (digital subscriber line, DSL)) or wireless (such as infrared, wireless, microwave, etc.) Another website site, computer, server or data center for transmission.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a digital versatile disc (DVD)), or a semiconductor medium (for example, a solid state disk (SSD)) )Wait.
  • a magnetic medium for example, a floppy disk, a hard disk, and a magnetic tape
  • an optical medium for example, a digital versatile disc (DVD)
  • DVD digital versatile disc
  • SSD solid state disk
  • the process can be completed by a computer program instructing related hardware.
  • the program can be stored in a volatile and non-volatile computer readable storage.
  • the program when executed, it may include the processes of the foregoing method embodiments.
  • the aforementioned storage media include: read-only memory (ROM) or random access memory (RAM), magnetic disks or optical disks and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Human Resources & Organizations (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Operations Research (AREA)
  • Game Theory and Decision Science (AREA)
  • Quality & Reliability (AREA)

Abstract

一种图像处理方法及装置、处理器、电子设备、存储介质。该方法包括:获取待处理图像、第一卷积核和第二卷积核,所述第一卷积核的感受野与所述第二卷积核的感受野不同(101);使用所述第一卷积核对所述待处理图像进行卷积处理获得第一特征图像,使用所述第二卷积核对所述待处理图像进行卷积处理获得第二特征图像(102);对所述第一特征图像和所述第二特征图像进行融合处理,获得第一人群密度图像(103)。应用本方法可获得与待处理图像对应的人群密度图像,进而确定待处理图像中的人数。

Description

图像处理方法及装置、处理器、电子设备、存储介质
本申请要求于2019年11月27日提交中国专利局、申请号为201911182723.7、发明名称为“图像处理方法及装置、处理器、电子设备、存储介质”,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像处理技术领域,尤其涉及一种图像处理方法及装置、处理器、电子设备、存储介质。
背景技术
当公共场所出现人流量过大的情况时,易发生诸如踩踏之类的公共事件。因此如何对公共场所进行人群计数具有重大意义。
传统方法基于深度学习技术可对公共场所的图像进行处理,提取出图像中的特征信息,并依据该特征信息可确定与公共场所的图像对应的人群密度图像,进而可依据人群密度图像确定该公共场所的图像种的人数,实现人群计数。
发明内容
本申请提供一种图像处理方法及装置、处理器、电子设备、存储介质。
第一方面,提供了一种图像处理方法,所述方法包括:
获取待处理图像、第一卷积核和第二卷积核,所述第一卷积核的感受野与所述第二卷积核的感受野不同;
使用所述第一卷积核对所述待处理图像进行卷积处理获得第一特征图像,使用所述第二卷积核对所述待处理图像进行卷积处理获得第二特征图像;
对所述第一特征图像和所述第二特征图像进行融合处理,获得第一人群密度图像。
在该方面中,通过使用感受野不同的第一卷积核和第二卷积核分别对待处理图像进行卷积处理,以提取出不同尺度下的描述待处理图像的内容的信息,分别获得第一特征图像和第二特征图像。通过对第一特征图像和第二特征图像进行融合处理,以利用不同尺度下的描述待处理图像的内容的信息,进而提高获得的与待处理图像对应的人群密度图像的精度。
在一种可能实现的方式中,在所述对所述第一特征图像和所述第二特征图像进行融合处理,获得第一人群密度图像之前,所述方法还包括:
对所述待处理图像进行第一特征提取处理,获得第一自注意力图像,对所述待处理图像进行第二特征提取处理,获得第二自注意力图像,所述第一自注意力图像和所述第二自注意力图像均用于表征所述待处理图像的尺度信息,且所述第一自注意力图像所表征的尺度信息与所述第二自注意力图像所表征的尺度信息不同;
依据所述第一自注意力图像确定所述第一特征图像的第一权重,依据所述第二自注意力图像确定所述第二特征图像的第二权重;
所述对所述第一特征图像和所述第二特征图像进行融合处理,获得第一人群密度图像,包括:
依据所述第一权重和所述第二权重对所述第一特征图像和所述第二特征图像进行融合处理,获得所述第一人群密度图像。
在该种可能实现的方式中,通过对待处理图像分别进行第一特征提取处理和第二特征提取处理以提取不同尺度下的待处理图像的信息,获得第一自注意力图像和第二自注意力图像。依据第一自注意力图像确定第一特征图像的第一权重,依据第二自注意力图像确定第二特征图像的第二权重,并依据第一权重和第二权重对第一特征图像和第二特征图像进行融合处理,可提高获得的第一人群密度图像的精度。
在另一种可能实现的方式中,所述依据所述第一权重和所述第二权重对所述第一特征图像和所述第二特征图像进行融合处理,获得所述第一人群密度图像,包括:
确定所述第一权重与所述第一特征图像之间的点积,获得第三特征图像;
确定所述第二权重与所述第二特征图像之间的点积,获得第四特征图像;
对所述第三特征图像和所述第四特征图像进行融合处理,获得所述第一人群密度图像。
在又一种可能实现的方式中,所述依据所述第一自注意力图像确定所述第一特征图像的第一权重,依据所述第二自注意力图像确定所述第二特征图像的第二权重,包括:
对所述第一自注意力图像和所述第二自注意力图像进行归一化处理,获得所述第一自注意力图像对应的第三自注意力图像和所述第二自注意力图像对应的第四自注意力图像;
将所述第三自注意力图像作为所述第一权重,将所述第四自注意力图像作为所述第二 权重。
在该种可能实现的方式中,通过对第一自注意力图像和第二自注意力图像进行归一化处理,可使第一自注意力图像与第二自注意力图像中相同位置的像素点的像素值的和为1。再通过将第一自注意力图像作为第一权重、将第二自注意力图像作为第二权重对第一特征图像和第二特征图像进行融合处理,可实现对待处理图像中不同图像区域执行不同感受野的卷积处理,进而提高获得的第一人群密度图像的精度。
在又一种可能实现的方式中,在所述使用所述第一卷积核对所述待处理图像进行卷积处理获得第一特征图像,使用所述第二卷积核对所述待处理图像进行卷积处理获得第二特征图像之前,所述方法还包括:
对所述待处理图像进行第三特征提取处理,获得第五特征图像;
所述使用所述第一卷积核对所述待处理图像进行卷积处理获得第一特征图像,使用所述第二卷积核对所述待处理图像进行卷积处理获得第二特征图像,包括:
使用所述第一卷积核对所述第五特征图像进行卷积处理获得所述第一特征图像,使用所述第二卷积核对所述第五特征图像进行卷积处理获得所述第二特征图像;
所述对所述待处理图像进行第一特征提取处理,获得第一自注意力图像,对所述待处理图像进行第二特征提取处理,获得第二自注意力图像,包括:
对所述第五特征图像进行所述第一特征提取处理,获得所述第一自注意力图像,对所述第五特征图像进行所述第二特征提取处理,获得所述第二自注意力图像。
在该种可能实现的方式中,在使用第一卷积核对待处理图像进行卷积处理获得第一特征图像,使用第二卷积核对待处理图像进行卷积处理获得第二特征图像之前,对待处理图像进行第三特征提取处理,以提取出待处理图像的特征信息,获得第五特征图像。使用第一卷积核对第五特征图像进行卷积处理获得第一特征图像,使用第二卷积核对所述第五特征图像进行卷积处理获得所述第二特征图像。这样可从待处理图像中提取出更丰富的特征信息。
在又一种可能实现的方式中,所述第一卷积核和所述第二卷积核均为空洞卷积核,且所述第一卷积核的大小与所述第二卷积核的大小相同,且所述第一卷积核的权重与所述第二卷积核的权重相同,且所述第一卷积核的扩张率与所述第二卷积核的扩张率不同。
在该种可能实现的方式中,在第一卷积核和第二卷积核均为空洞卷积核的情况下,可将第一卷积核的权重与第二卷积核的权重取为相同,且可使第一卷积核的感受野与第二卷积核的感受野不同。这样,使用第一卷积核对待处理图像进行卷积处理获得的第一特征图像包含的信息和使用第二卷积核对待处理图像进行卷积核处理获得的第二特征图像包含的信息仅存在尺度上的差异。在对第一特征图像和第二特征图像进行融合处理时,可更好的利用不同尺度下待处理图像的信息提高获得的第一人群密度图像的精度。
在又一种可能实现的方式中,所述第一卷积核或所述第二卷积核的扩张率为参考值。
在该种可能实现的方式中,通过将第一卷积核或第二卷积核的扩张率设为0(即参考值),可在使用第一卷积核或第二卷积核对待处理图像进行卷积处理时实现对待处理图像进行感受野为1的卷积处理,以更好的提取出待处理图像中尺度小的图像区域的信息。
在又一种可能实现的方式中,所述方法还包括:确定所述第一人群密度图像中的像素值的和,获得所述待处理图像中的人数。
在该种可能实现的方式中,依据第一人群密度图像可确定待处理图像中的人数。
在又一种可能实现的方式中,所述方法应用于人群计数网络;
所述人群计数网络的训练过程包括:
获取样本图像;
使用所述人群计数网络对所述样本图像进行处理,获得第二人群密度图像;
依据所述样本图像与所述第二人群密度图像之间的差异,获得网络损失;
基于所述网络损失调整所述人群计数网络的参数。
在该种可能实现的方式中,使用训练后的人群计数网络对待处理图像进行处理,可获得与待处理图像对应的人群密度图像。
在又一种可能实现的方式中,在所述依据所述样本图像与所述第二人群密度图像之间的差异,获得网络损失之前,所述方法还包括:
依据冲击函数、高斯核以及所述样本图像,获得所述样本图像的真实人群密度图像;
所述依据所述样本图像与所述第二人群密度图像之间的差异,获得网络损失,包括:
依据所述真实人群密度图像与所述第二人群密度图像之间的差异,获得所述网络损失。
在该种可能实现的方式中,将样本图像的真实人群密度图像作为人群计数网络的监督数据,依据真实人群密度图像与第二人群密度图像之间的差异,确定人群计数网络的网络损失,可提高获得的网络损失的精度,进而提升对人群计数网络的训练效果。
在又一种可能实现的方式中,在所述经所述人群计数网络对所述样本图像进行处理,获得第二人群密度图像之前,所述方法还包括:
对所述样本图像进行预处理,获得至少一张预处理后的图像;
所述经所述人群计数网络对所述样本图像进行处理,获得第二人群密度图像,包括:
使用所述人群计数网络对所述至少一张预处理后的图像进行处理,获得至少一张第三人群密度图像,所述预处理后的图像与所述第三人群密度图像一一对应;
所述依据所述样本图像与所述第二人群密度图像之间的差异,获得网络损失,包括:
依据所述至少一张预处理后的图像中的目标图像和与所述目标图像对应的第三人群密度图像之间的差异,获得所述网络损失。
在该种可能实现的方式中,在将样本图像输入至人群计数网络之前,通过对样本图像进行预处理,获得至少一张预处理后的图像,并将上述至少一张预处理后的图像作为训练数据输入至人群计数网络。这样,可达到扩充人群计数网络的训练数据集的效果。
在又一种可能实现的方式中,所述预处理包括:从所述样本图像中截取预定尺寸的图像、对所述样本图像或所述预定尺寸的图像进行翻转处理中的至少一种。
第二方面,提供了一种图像处理装置,所述装置包括:
获取单元,用于获取待处理图像、第一卷积核和第二卷积核,所述第一卷积核的感受野与所述第二卷积核的感受野不同;
卷积处理单元,用于使用所述第一卷积核对所述待处理图像进行卷积处理获得第一特征图像,使用所述第二卷积核对所述待处理图像进行卷积处理获得第二特征图像;
融合处理单元,用于对所述第一特征图像和所述第二特征图像进行融合处理,获得第一人群密度图像。
在一种可能实现的方式中,所述装置还包括:
特征提取处理单元,用于在所述对所述第一特征图像和所述第二特征图像进行融合处理,获得第一人群密度图像之前,对所述待处理图像进行第一特征提取处理,获得第一自注意力图像,对所述待处理图像进行第二特征提取处理,获得第二自注意力图像,所述第一自注意力图像和所述第二自注意力图像均用于表征所述待处理图像的尺度信息,且所述第一自注意力图像所表征的尺度信息与所述第二自注意力图像所表征的尺度信息不同;
第一确定单元,用于依据所述第一自注意力图像确定所述第一特征图像的第一权重,依据所述第二自注意力图像确定所述第二特征图像的第二权重;
所述融合处理单元用于:
依据所述第一权重和所述第二权重对所述第一特征图像和所述第二特征图像进行融合处理,获得所述第一人群密度图像。
在另一种可能实现的方式中,所述融合处理单元具体用于:
确定所述第一权重与所述第一特征图像之间的点积,获得第三特征图像;
确定所述第二权重与所述第二特征图像之间的点积,获得第四特征图像;
对所述第三特征图像和所述第四特征图像进行融合处理,获得所述第一人群密度图像。
在又一种可能实现的方式中,所述第一确定单元用于:
对所述第一自注意力图像和所述第二自注意力图像进行归一化处理,获得所述第一自注意力图像对应的第三自注意力图像和所述第二自注意力图像对应的第四自注意力图像;
将所述第三自注意力图像作为所述第一权重,将所述第四自注意力图像作为所述第二权重。
在又一种可能实现的方式中,所述特征提取处理单元,还用于在所述使用所述第一卷积核对所述待处理图像进行卷积处理获得第一特征图像,使用所述第二卷积核对所述待处理图像进行卷积处理获得第二特征图像之前,对所述待处理图像进行第三特征提取处理,获得第五特征图像;
所述卷积处理单元用于:
使用所述第一卷积核对所述第五特征图像进行卷积处理获得所述第一特征图像,使用所述第二卷积核对所述第五特征图像进行卷积处理获得所述第二特征图像;
所述特征提取处理单元还用于:
对所述第五特征图像进行所述第一特征提取处理,获得所述第一自注意力图像,对所述第五特征图像进行所述第二特征提取处理,获得所述第二自注意力图像。
在又一种可能实现的方式中,所述第一卷积核和所述第二卷积核均为空洞卷积核,且所述第一卷积核的大小与所述第二卷积核的大小相同,且所述第一卷积核的权重与所述第二卷积核的权重相同,且所述第一卷积核的扩张率与所述第二卷积核的扩张率不同。
在又一种可能实现的方式中,所述第一卷积核或所述第二卷积核的扩张率为参考值。
在又一种可能实现的方式中,所述装置还包括:第二确定单元,用于确定所述第一人群密度图像中的像素值的和,获得所述待处理图像中的人数。
在又一种可能实现的方式中,所述装置执行的图像处理方法应用于人群计数网络;
所述装置还包括:训练单元,用于对所述人群计数网络进行训练,所述人群计数网络的训练过程包括:
获取样本图像;
使用所述人群计数网络对所述样本图像进行处理,获得第二人群密度图像;
依据所述样本图像与所述第二人群密度图像之间的差异,获得网络损失;
基于所述网络损失调整所述人群计数网络的参数。
在又一种可能实现的方式中,所述训练单元还用于:
在所述依据所述样本图像与所述第二人群密度图像之间的差异,获得网络损失之前,依据冲击函数、高斯核以及所述样本图像,获得所述样本图像的真实人群密度图像;
依据所述真实人群密度图像与所述第二人群密度图像之间的差异,获得所述网络损失。
在又一种可能实现的方式中,所述训练单元还用于:
在所述经所述人群计数网络对所述样本图像进行处理,获得第二人群密度图像之前,对所述样本图像进行预处理,获得至少一张预处理后的图像;
使用所述人群计数网络对所述至少一张预处理后的图像进行处理,获得至少一张第三人群密度图像,所述预处理后的图像与所述第三人群密度图像一一对应;
依据所述至少一张预处理后的图像中的目标图像和与所述目标图像对应的第三人群密度图像之间的差异,获得所述网络损失。
在又一种可能实现的方式中,所述预处理包括:从所述样本图像中截取预定尺寸的图像、对所述样本图像或所述预定尺寸的图像进行翻转处理中的至少一种。
第三方面,提供了一种处理器,所述处理器用于执行如上述第一方面及其任意一种可能实现的方式的方法。
第四方面,提供了一种电子设备,包括:相互连接的处理器和存储器,所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令,当所述处理器执行所述计算机指令时,所述电子设备执行如上述第一方面及其任意一种可能实现的方式的方法。
第五方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被电子设备的处理器执行时,使所述处理器执行如上述第一方面及其任意一种可能实现的方式的方法。
第六方面,提供了一种包含指令的计算机程序产品,当所述计算机程序产品在计算机上运行时,使得计算机执行上述第一方面及其任一种可能的实现方式的方法。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本公开。
附图说明
为了更清楚地说明本申请实施例或背景技术中的技术方案,下面将对本申请实施例或背景技术中所需要使用的附图进行说明。
此处的附图被并入说明书中并构成本说明书的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。
图1为本申请实施例提供的一种图像处理方法的流程示意图;
图2a为本申请实施例提供的一种卷积核的示意图;
图2b为本申请实施例提供的一种卷积核的权重的示意图;
图3为本申请实施例提供的一种相同位置的元素的示意图;
图4为本申请实施例提供的一种人群图像示意图;
图5为本申请实施例提供的另一种图像处理方法的流程示意图;
图6a为本申请实施例提供的一种空洞卷积核的示意图;
图6b为本申请实施例提供的另一种空洞卷积核的示意图;
图7为本申请实施例提供的又一种空洞卷积核的示意图;
图8为本申请实施例提供的一种人群计数网络的结构示意图;
图9为本申请实施例提供的一种尺度感知型卷积层的结构示意图;
图10为本申请实施例提供的一种图像处理装置的结构示意图;
图11为本申请实施例提供的一种图像处理装置的硬件结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或 设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其他步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
在公共场所(例如广场、超市、地铁站、码头等地方)中,有时会存在人流量过多的情况,进而导致人群过于密集的情况发生。这时易发生一些公共事故,例如踩踏事件。因此,如何对公共场所进行人群计数就变得非常有意义。
随着深度学习技术的发展,基于深度学习的方法可确定图像中的人数,实现人群计数。传统的深度学习方法通过使用一个卷积核对整张图像进行卷积处理以提取出图像中的特征信息,并依据特征信息确定图像中的人数。由于一个卷积核的感受野是固定不变的,若使用一个卷积核对整张图像进行卷积处理,即相当于对图像中不同尺度的内容进行相同感受野的卷积处理,而不同人物在图像中的尺度不同,这将导致不能有效提取出图像中的尺度信息,进而导致确定的人数的误差。
本申请中,图像中近处的人物对应的图像尺度大,图像中远处的人物对应的图像尺度小。本申请实施例中的“远”指与图像中人物对应的真实人物与采集上述图像的成像设备之间的距离远,“近”指与图像中人物对应的真实人物与采集上述图像的成像设备之间的距离近。
在卷积神经网络中,感受野(receptive field)的定义是卷积神经网络每一层输出的特征图(feature map)上的像素点在输入图片上映射的区域大小。本申请中,卷积核的感受野即为使用该卷积核对图像进行卷积处理的感受野。
本申请实施例提供的技术方案可提取出图像中的尺度信息,进而提升确定的人数的精度。
下面结合本申请实施例中的附图对本申请实施例进行描述。
请参阅图1,图1是本申请实施例(一)提供的一种图像处理方法的流程示意图。
101、获取待处理图像、第一卷积核和第二卷积核,上述第一卷积核的感受野与上述第二卷积核的感受野不同。
本申请实施例的执行主体可以是服务器、手机、电脑、平板电脑等终端硬件。本申请实施例提供的方法也可通过处理器运行计算机可执行代码的方式执行。上述待处理图像可以是任意图像。例如,待处理图像可以包含人物对象,其中,待处理图像可以只包括人脸,并无躯干、四肢(下文将躯干和四肢称为人体),也可以只包括人体,不包括人脸,还可以只包括下肢或上肢。本申请对待处理图像具体包含的人体区域不做限定。又例如,待处理图像可以包含动物。再例如,待处理图像可以包含植物。本申请对待处理图像中包含的内容不做限定。
在进行接下来的阐述之前,首先对本申请实施例中的卷积核的权重的含义进行定义。本申请实施例中,通道为1的卷积核以n*n的矩阵的形式存在,该矩阵中包含n*n个元素,每个元素均有一个取值,该矩阵中元素的取值即为卷积核的权重。在图2a所示的3*3的卷积核中,若元素a的取值为44、元素b的取值为118、元素c的取值为192、元素d的取值为32、元素e的取值为83、元素f的取值为204、元素g的取值为61、元素h的取值为174、元素i的取值为250,则该3*3的卷积核的权重为图2b所示的3*3的矩阵。
本申请实施例中,在满足第一卷积核的感受野与第二卷积核的感受野不同的情况下,第一卷积核和第二卷积核均可是任意大小的卷积核,且第一卷积核的权重和第二卷积核的权重均可为任意自然数,本实施例对第一卷积核的大小、第二卷积核的大小、第一卷积核的权重以及第二卷积核的权重均不做限定。
获取待处理图像的方式可以是接收用户通过输入组件输入的待处理图像,也可以是接收终端发送的待处理图像。获取第一卷积核的方式可以是接收用户通过输入组件输入的第一卷积核,也可以是接收终端发送的第一卷积核。获取第二卷积核的方式可以是接收用户通过输入组件输入的第二卷积核,也可以是接收终端发送的第二卷积核。上述输入组件包括:键盘、鼠标、触控屏、触控板和音频输入器等。上述终端包括手机、计算机、平板电脑、服务器等。
102、使用上述第一卷积核对上述待处理图像进行卷积处理获得第一特征图像,使用上述第二卷积核对上述待处理图像进行卷积处理获得第二特征图像。
由于第一卷积核的感受野与第二卷积核的感受野不同,使用第一卷积核对待处理图像进行卷积处理和使用第二卷积核对待处理图像进行卷积处理相当于以不同的感受野“观察” 图像,实现获得不同尺度下的图像信息。即第一特征图像和第二特征图像均包含用于描述待处理图像的内容的信息,但第一特征图像包含的信息的尺度与第二特征图像包含的信息的尺度不同。
103、对上述第一特征图像和上述第二特征图像进行融合处理,获得第一人群密度图像。
本申请实施例中,人群密度图像包含人群密度信息。人群密度图像中的每个像素点的像素值表征在该像素点处的人数。举例来说,人群密度图像中的像素点A的像素值为0.05,则像素点A处有0.05个人。
需要理解的是,由于一个人覆盖的图像区域包含至少一个像素点,当一个人覆盖的图像区域为1个像素点时,该像素点对应的像素值为1,当一个人覆盖的图像区域为至少两个像素点时,该至少两个像素点的像素值的和为1。因此,人群密度图像中的像素值的取值范围为:大于或等于0且小于或等于1。举例来说,人物A覆盖的图像区域包含像素点a、像素点b和像素点c,则像素点a的像素值+像素点b的像素值+像素点c的像素值=1。
上述第一人群密度图像为与待处理图像对应的人群密度图像,可表征待处理图像中的人群密度分布。第一人群密度图像的尺寸与待处理图像的尺寸相同。本实施例中图像的尺寸指图像的宽和高。第一人群密度图像中的第一像素点的像素值可用于表征待处理图像中的第二像素点处的人数。其中,第一像素点在第一人群密度图像中的位置与第二像素点在待处理图像中的位置相同。
本申请实施例中,两张图像中相同位置的像素点可参见图3,如图3所示,像素点A 11在图像A中的位置与像素点B 11在图像B中的位置相同,像素点A 12在图像A中的位置与像素点k在图像B 12中的位置相同,像素点A 13在图像A中的位置与像素点B 13在图像B中的位置相同,像素点A 21在图像A中的位置与像素点B 21在图像B中的位置相同,像素点A 22在图像A中的位置与像素点B 22在图像B中的位置相同,像素点A 23在图像A中的位置与像素点B 23在图像B中的位置相同,像素点A 31在图像A中的位置与像素点B 31在图像B中的位置相同,像素点A 32在图像A中的位置与像素点B 32在图像B中的位置相同,像素点A 33在图像A中的位置与像素点B 33在图像B中的位置相同。
若像素点x在图像X中的位置与像素点y在图像Y中的位置相同,为简洁表述,下文将像素点x称为图像X中与像素点y位置相同的像素点,或将像素点y称为图像Y中与像素点x位置相同的像素点。
由于第一特征图像包含描述待处理图像的图像内容的信息的尺度和第二待处理图像包含描述待处理图像的图像内容的信息的尺度不同,通过对第一特征图像和第二特征图像进行融合处理(例如对应位置的像素值加权处理等),可利用不同尺度下的描述待处理图像的图像内容的信息生成待处理图像对应的人群密度图像,即第一人群密度图像。这样,可提高获得的与待处理图像对应的人群密度图像的精度,进而提升获得的待处理图像中人数的精度。
需要理解的是,本实施例阐述了通过两个感受野不同的卷积核(即第一卷积核和第二卷积核)分别对待处理图像进行卷积处理,获得两个尺度下的描述待处理图像的图像内容的信息。但在实际使用中,也可通过三个或三个以上感受野不同的卷积核分别对待处理图像进行卷积处理,以获得三个或三个以上尺度下的描述待处理图像的图像内容的信息,并将该三个或三个以上尺度下的描述待处理图像的图像内容的信息进行融合,获得与待处理图像对应的人群密度图像。
可选的,在获得第一人群密度图像后,可通过确定第一人群密度图像中所有像素点的像素值的和,得到待处理图像中的人数。
本实施例通过使用感受野不同的第一卷积核和第二卷积核分别对待处理图像进行卷积处理,以提取出不同尺度下的描述待处理图像的内容的信息,分别获得第一特征图像和第二特征图像。通过对第一特征图像和第二特征图像进行融合处理,以利用不同尺度下的描述待处理图像的内容的信息,提高获得的与待处理图像对应的人群密度图像的精度,进而提升获得的待处理图像中人数的精度。
在图像中,近处的人物覆盖的图像区域的面积比远处的人物覆盖的图像区域的面积大。例如,图4中人物A相较于人物B为近处的人物,且人物A覆盖的图像区域的面积比人物B覆盖的图像区域的面积大。而近处的人物覆盖的图像区域的尺度大,远处的人物覆盖的图像区域的尺度小。因此,人物覆盖的图像区域的面积与人物覆盖的图像区域的尺度呈正相关。显然,当卷积处理的感受野与人物覆盖的图像区域的面积相同时,通过卷积处理获得的人物覆盖的图像区域的信息最丰富(下文将可获得人物覆盖的图像区域的最丰富的信息的感受野称为人物覆盖区域的最佳感受野)。也就是说,人物覆盖的图像区域的尺度与人物覆盖区域的最佳感受野呈正相关。
虽然实施例(一)通过使用感受野不同的第一卷积核和第二卷积核分别对待处理图像进行卷积处理获得不同尺度下的描述待处理图像的内容的信息。但第一卷积核的感受野和第二卷积核的感受野均为固定的,而待处理图像中不同的图像区域的尺度不同,因此分别 使用第一卷积核和第二卷积核对待处理图像进行卷积处理无法获得待处理图像中每个图像区域的最佳感受野,即无法使获得的待处理图像中不同图像区域的信息均为最丰富。为此,本申请实施例还提供了一种通过在对第一特征图像和第二特征图像进行融合处理时为第一特征图像和第二特征图像赋予权重,以实现对待处理图像中不同尺度的图像区域进行不同感受野的卷积处理,进而获得更丰富的信息。
请参阅图5,图5是本申请实施例(二)提供的另一种图像处理方法的流程示意图。
501、对上述待处理图像进行第一特征提取处理,获得第一自注意力图像,对上述待处理图像进行第二特征提取处理,获得第二自注意力图像,上述第一自注意力图像和上述第二自注意力图像均用于表征上述待处理图像的尺度信息,且上述第一自注意力图像所表征的尺度信息与上述第二自注意力图像所表征的尺度信息不同。
本申请实施例中,特征提取处理可以是卷积处理,也可以是池化处理,还可以是卷积处理和池化处理的结合。本申请对第一特征提取处理的实现方式和第二特征提取处理的实现方式不做限定。
在一种可能实现的方式中,依次通过多层卷积层对待处理图像进行逐级卷积处理,实现对待处理图像的第一特征提取处理,获得第一自注意力图像。同理,可依次通过多层卷积层对待处理图像进行逐级卷积处理,实现对待处理图像的第二特征提取处理,获得第二自注意力图像。
可选的,在使用第一卷积核对待处理图像进行卷积处理获得第一特征图像,使用第二卷积核对待处理图像进行卷积处理获得第二特征图像之前,可对待处理图像进行第三特征提取处理,以提取出待处理图像的特征信息,获得第五特征图像。使用第一卷积核对第五特征图像进行卷积处理获得第一特征图像,使用第二卷积核对所述第五特征图像进行卷积处理获得所述第二特征图像。这样可从待处理图像中提取出更丰富的特征信息。
上述第一自注意力图像的尺寸和上述第二自注意力图像的尺寸均与待处理图像的尺寸相同。上述第一自注意力图像和上述第二自注意力图像均可用于表征待处理图像的尺度信息(即待处理图像中不同图像区域的尺度),且第一自注意力图像所表征的尺度信息与第二自注意力图像所表征的尺度信息不同。本申请实施例中,图像(包括:上述第一特征图像、上述第二特征图像、上述第一自注意力图像、上述第二自注意力图像、下文将要提及的第三自注意力图像等)的尺度与对待处理图像进行特征提取处理(包括上述第一特征提取处理、上述第二特征提取处理以及上述第三特征提取处理)时所使用的卷积核的感受野匹配。例如,使用大小为3*3的卷积核对图像进行卷积处理得到的图像的尺度为a,使用大小为5*5的卷积核对图像进行卷积处理得到的图像的尺度为b,那么使用大小为3*3的卷积核对待处理图像进行特征提取处理得到的自注意力图像的尺度为a(即该自注意力图像可表征待处理图像在尺度a的信息),使用大小为5*5的卷积核对待处理图像进行特征提取处理得到的特征图像的尺度为b。
举例来说(例1),第一自注意力图像表征待处理图像在尺度a下的信息,第二自注意力图像表征待处理图像在尺度b下的信息,其中,尺度a大于尺度b。
第一自注意力图像中的像素点的像素值和第二自注意力图像中的像素点的像素值的取值范围均为:大于或等于0,且小于或等于1。第一自注意力图像(或第二自注意力图像)中的某个像素点的像素值越接近于1,表征在待处理图像中与该像素点位置相同的像素点的最佳尺度与第一自注意力图像(或第二自注意力图像)所表征的尺度越接近。本申请实施例中,最佳尺度即为与该像素点的最佳感受野对应的尺度。
接着例1继续举例,像素点a和像素点b为第一自注意力图像中的两个不同的像素点,像素点c为待处理图像中与像素点a在第一自注意力图像中的位置相同的像素点,像素点d为待处理图像中与像素点b在第一自注意力图像中的位置相同的像素点。若像素点a的像素值为0.9,像素点b的像素值为0.7。则像素点c的最佳尺度与尺度a之间的差异小于像素点d的最佳尺度与尺度a之间的差异。
502、依据上述第一自注意力图像确定上述第一特征图像的第一权重,依据上述第二自注意力图像确定上述第二特征图像的第二权重。
可选的,上述第一自注意力图像所表征的尺度与第一特征图像的尺度相同,上述第二自注意力图像所表征的尺度与第二特征图像的尺度相同。则第一自注意力图像中的像素点的像素值与1越接近表征第一特征图像中与该像素点在第一自注意力图像中的位置相同的像素点的最佳尺度与第一特征图像的尺度越接近,第二自注意力图像中的像素点的像素值与1越接近表征第二特征图像中与该像素点在第二自注意力图像中的位置相同的像素点的最佳尺度与第二特征图像的尺度越接近。
因此,可依据第一自注意力图像确定第一特征图像的第一权重,以调整第一特征图像中的像素点的尺度,使第一特征图像中的像素点更接近最佳尺度。同理,可依据第二自注意力图像确定第二特征图像的第二权重,以调整第二特征图像中的像素点的尺度,使第二特征图像中的像素点更接近最佳尺度。
在一种可能实现的方式中,可对第一自注意力图像和第二自注意力图像进行归一化处理,获得第一自注意力图像对应的第三自注意力图像和第二自注意力图像对应的第四自注意力图像。将第三自注意力图像作为上述第一权重,将第四自注意力图像作为上述第二权重。
在上述可能实现的方式中,通过对第一自注意力图像和第二自注意力图像进行归一化处理,可使第一自注意力图像与第二自注意力图像中相同位置的像素点的像素值的和为1。举例来说,像素点a在第一自注意力图像中的位置与像素点b在第二自注意力图像中的位置相同,则对第一自注意力图像和第二自注意力图像进行归一化处理后像素点a的像素值和像素点b的像素值的和为1。如像素点c在第三自注意力图像中的位置与像素点a在第一自注意力图像中的位置相同,像素点d在第四自注意力图像中的位置与像素点b在第二自注意力图像中的位置相同,则像素点c的像素值与像素点d的像素值的和为1。
可选的,上述归一化处理可通过将第一自注意力图像和第二自注意力图像分别输入至softmax函数实现。需要理解的是,若第一自注意力图像和第二自注意力图像均包含多个通道的图像,则将第一自注意力图像与第二自注意力图像中相同通道的图像分别输入至softmax函数。例如,第一自注意力图像和第二自注意力图像均包含2个通道的图像,则在对第一自注意力图像和第二自注意力图像进行归一化处理时,可将第一自注意力图像中第一个通道的图像和第二自注意力图像中第一个通道的图像输入至softmax函数,获得第三自注意力图像中第一个通道的图像以及第四自注意力图像中第一个通道的图像。
503、依据上述第一权重和上述第二权重对上述第一特征图像和上述第二特征图像进行融合处理,获得上述第一人群密度图像。
由于获得第一特征图像的卷积处理的感受野和获得第二特征图像的卷积处理的感受野不同。通过将第三自注意力图像作为第一特征图像的第一权重,将第四自注意力图像作为第二特征图像的第二权重对第一特征图像和第二特征图像进行融合处理,可对待处理图像中的不同图像区域进行最佳感受野下的卷积处理。这样,可充分提取待处理图像中不同图像区域的信息,使获得的与待处理图像对应的人群密度图像的精度更高。
在一种依据第一权重和第二权重对第一特征图像和第二特征图像进行融合处理,获得第一人群密度图像的实现方式中,计算第一权重与第一特征图像之间的点积,获得第三特征图像,计算第二权重与第二特征图像之间的点积,获得第四特征图像。通过对第三特征图像和第四特征图像进行融合处理(例如相同位置的像素值相加),可获得第一人群密度图像。
本实施例通过对待处理图像分别进行第一特征提取处理和第二特征提取处理以提取不同尺度下的待处理图像的信息,获得第一自注意力图像和第二自注意力图像。依据第一自注意力图像确定第一特征图像的第一权重,依据第二自注意力图像确定第二特征图像的第二权重,并依据第一权重和第二权重对第一特征图像和第二特征图像进行融合处理,可提高获得的第一人群密度图像的精度。
在实施例(一)和实施例(二)中的第一卷积核的权重和第二卷积核的权重不同时,使用第一卷积核对待处理图像进行卷积处理提取出的特征信息的侧重点与使用第二卷积核对待处理图像进行卷积处理提取出的特征信息的侧重点不同。例如,使用第一卷积核对待处理图像进行卷积处理侧重于提取出待处理图像中人物的属性特征(如衣服颜色、裤子长度),而使用第二卷积核对待处理图像进行卷积处理侧重于提取出待处理图像中人物的轮廓特征(该轮廓特征可用于识别待处理图像中是否包含人物)。再考虑到第一卷积核的感受野和第二卷积核的感受野的不同。这样,在后续对提取出的第一特征图像和第二特征图像进行融合处理时,需要将不同尺度下的不同特征信息进行融合(如将尺度a下的属性特征与尺度b下的轮廓特征融合),这将给尺度信息的融合带来困难。
为此,本申请实施例还提供了一种技术方案,将第一卷积核的权重和第二卷积核的权重取为相同,以减小对第一特征图像和第二特征图像进行融合处理时非尺度信息的融合,提高尺度信息融合的效果,进而提高获得的第一人群密度图像的精度。
由于若第一卷积核和第二卷积核为常规卷积核,在第一卷积核的感受野与第二卷积核的感受野不同的情况下,第一卷积核的权重与第二卷积核的权重不可能相同。因此,在接下来阐述的技术方案中第一卷积核和第二卷积核均为空洞卷积核,且第一卷积核的大小与第二卷积核的大小相同,且第一卷积核的权重与第二卷积核的权重相同,且第一卷积核的扩张率与第二卷积核的扩张率不同。
举例来说,如图6a、图6b所示的两个空洞卷积核,上述两个空洞卷积核的大小均为3*3,其中,图6a所示的空洞卷积核和图6b所示的空洞卷积核中的黑色区域表示有参数,白色部分表示没有参数(即参数为0)。可选的,可将图6a所示的空洞卷积核的权重与图6b所示的空洞卷积核的权重取为相同。此外,从图中可以看出,由于图6a所示的空洞卷积核的扩张率为2,图6b所示的空洞卷积核的扩张率为1,图6a所示的空洞卷积核的感受野与图6b所示的空洞卷积核的感受野不同,具体的,图6a所示的空洞卷积核的感受野(5*5) 比图6b所示的空洞卷积核的感受野(3*3)大。
在第一卷积核和第二卷积核均为空洞卷积核的情况下,可将第一卷积核的权重与第二卷积核的权重取为相同,且可使第一卷积核的感受野与第二卷积核的感受野不同。这样,使用第一卷积核对待处理图像进行卷积处理获得的第一特征图像包含的信息和使用第二卷积核对待处理图像进行卷积核处理获得的第二特征图像包含的信息仅存在尺度上的差异。在对第一特征图像和第二特征图像进行融合处理时,可更好的利用不同尺度下待处理图像的信息提高获得的第一人群密度图像的精度。
可选的,可通过使第一卷积核和第二卷积核共享同一组权重的方式使第一卷积核的权重与第二卷积核的权重相同,这样,在后续分别使用第一卷积核和第二卷积核对待处理图像进行卷积处理时,可减少所需处理的参数的数量。
在空洞卷积核的大小一定的情况下,空洞卷积核的感受野与空洞卷积核的扩张率呈正相关。当空洞卷积核的扩张率为1时,空洞卷积核的感受野与相同大小的常规卷积核的感受野相同,如:图6b所示的空洞卷积核的扩张率为1,此时该空洞卷积核的感受野与大小为3*3的常规卷积核的感受野相同。
考虑到待处理图像中存在最佳尺度较小的像素区域,这些尺度较小的图像区域需要使用较小的感受野的卷积处理才能提取出更丰富的信息。为此本申请实施例还提供了一种将空洞卷积核的扩张率设为0(即参考值),使空洞卷积核的感受野小于常规卷积核的感受野,以更好的提取出待处理图像中尺度较小的图像区域的信息。
下面将从理论上推导扩张率为0的空洞卷积核如何实现。
假设使用一个大小为3*3,扩张率为d的空洞卷积核对待处理图像进行卷积处理,则该卷积处理的过程满足下式:
Figure PCTCN2019125297-appb-000001
其中,x和y分别为空洞卷积核滑动至待处理图像上某个像素点时空洞卷积核的中心像素点的位置。(x+i,y+i)为待处理图像中的采样点在待处理图像中的坐标,w (1+i,1+i)为空洞卷积核的权重,b为空洞卷积核的偏差。I为待处理图像,O为使用空洞卷积核对待处理图像进行卷积处理获得的特征图像。
当d=0时,式(1)可转化为下式:
Figure PCTCN2019125297-appb-000002
其中,w′ k表示大小为1*1的常规卷积核的权重,b′ k表示大小为1*1的常规卷积核的偏差。从式(2)可以看出使用一个大小为3*3、扩张率为0的空洞卷积核对待处理图像进行卷积处理等价于使用9个大小为1*1的常规卷积核分别对待处理图像进行卷积处理。因此,扩张率为0的空洞卷积核可使用9个1*1的常规卷积核代替,即扩张率为0的空洞卷积核中所有权重均位于空洞卷积核上的同一个位置。图7所示为大小为3*3、扩张率为0的空洞卷积核,图6所示的空洞卷积核中的黑色区域即为权重所在的位置。从图6所示的空洞卷积核可以看出,扩张率为0的空洞卷积核的感受野为1。
本申请实施例中,在第一卷积核为空洞卷积核的情况下,通过将第一卷积核的扩张率设为0,可在使用第一卷积核对待处理图像进行卷积处理时实现对待处理图像进行感受野为1的卷积处理,以更好的提取出待处理图像中尺度小的图像区域的信息。
本申请实施例还提供了一种人群计数网络,可用于实现前文所提及的技术方案。请参阅图8,图8为本申请实施例提供的一种人群计数网络的结构示意图。如图8所示,人群计数网络中的网络层依次串联,共包含11层卷积层和9层池化层和6层尺度感知型卷积层。
将待处理图像输入至人群计数网络,经第一层卷积层对待处理图像进行处理获得第一 层卷积层输出的图像,第一层卷积层输出的图像经第二层卷积层的处理获得第二层卷积层输出的图像,第二层卷积层输出的图像经第一层池化层的处理获得第一层池化层输出的图像,…,第十层卷积层输出的图像经第一层尺度感知型卷积层的处理获得第一层尺度感知型卷积层输出的图像,…,第九层池化层输出的图像经第十一层卷积层的处理获得第一人群密度图像。
可选的,人群计数网络中除上述第十一层卷积层之外的所有卷积层中的卷积核的大小均可为3*3,第十一层卷积层中的卷积核的大小为1*1。第一层卷积层中卷积核的数量和第二层卷积层中卷积核的数量均可为64,第三层卷积层中卷积核的数量和第四层卷积层中卷积核的数量均可为128,第五层卷积层中卷积核的数量、第六层卷积层中卷积核的数量以及第七层卷积层中卷积核的数量均可为256,第八层卷积层中卷积核的数量、第九层卷积层中卷积核的数量以及第十层卷积层中卷积核的数量均可为512,第十一层卷积层中卷积核的数量为1。
人群计数网络中的池化层可以为最大池化层,也可以是平均池化层,本申请对此不做限定。
尺度感知型卷积层的结构示意图可参见图9。如图9所示,尺度感知型卷积层包括三个空洞卷积核、一个自注意力模块。上述三个空洞卷积核的结构可参见图6a、图6b和图7,此处将不再赘述。上述自注意力模块包含3个并联的卷积层。
尺度感知型卷积层的输入图像分别经3个不同感受野的空洞卷积核的处理,分别获得第六特征图像、第七特征图像和第八特征图像。
尺度感知型卷积层的输入图像分别经自注意力模块中的3个卷积层的卷积处理,分别获得第五自注意力图像、第六自注意力图像和第七自注意力图像。
第六特征图像的尺度与第五自注意力图像的尺度相同,第七特征图像的尺度与第六自注意力图像的尺度相同,第八特征图像的尺度与第七自注意力图像的尺度相同。通过将第五自注意力图像作为第六特征图像的权重,将第六自注意力图像作为第七特征图像的权重,将第七自注意力图像作为第八特征图像的权重,对第六特征图像、第七特征图像和第八特征图像进行融合处理,获得尺度感知型卷积层的输出图像。即将第五自注意力图像与第六特征图像进行点乘获得第九特征图像,将第六自注意力图像与第七特征图像进行点乘获得第十特征图像,将第七自注意力图像与第八特征图像进行点乘获得第十一特征图像。对第九特征图像、第十特征图像和第十一特征图像进行融合处理,获得尺度感知型卷积层的输出图像。可选的上述融合处理可以是将进行融合处理的两张图像中相同位置的像素点的像素值相加。
需要理解的是,图8所示的人群计数网络中网络层的具体数量仅为一个示例,不应对本申请构成限定。
在应用图8所示的人群计数网络对待处理图像执行人群计数任务之前,需对人群计数网络进行训练。为此,本申请还提供了一种人群计数网络的训练方法。该训练方法可包括以下步骤:获取样本图像。经人群计数网络对样本图像进行处理,获得第二人群密度图像。依据样本图像与第二人群密度图像之间的差异,获得网络损失。基于网络损失调整人群计数网络的参数。
上述样本图像可以是任意数字图像。例如,样本图像可以包含人物对象,其中,样本图像可以只包括人脸,并无躯干、四肢(下文将躯干和四肢称为人体),也可以只包括人体,不包括人脸,还可以只包括下肢或上肢。本申请对样本图像具体包含的人体区域不做限定。又例如,样本图像可以包含动物。再例如,样本图像可以包含植物。本申请对样本图像中包含的内容不做限定。
经人群计数网络对样本图像的处理获得与样本图像对应的第二人群密度图像后,可依据样本图像与第二人群密度图像之间的差异确定人群计数网络的网络损失。上述差异可以是样本图像与第二人群密度图像中相同位置的像素点的像素值之间的差异。本申请实施例中样本图像中像素点的像素值可用于表征像素点处是否有人物,例如,人物A在样本图像中所覆盖的图像区域包含像素点a,像素点b,像素点c,那么像素点a的像素值、像素点b的像素值和像素点c的像素值均为1。若样本图像中的像素点d不属于人物覆盖的图像区域,则像素点的像素值为0。
在确定人群计数网络的网络损失后,可基于该网络损失通过反向梯度传播的方式调整人群计数网络的参数,直至人群计数网络收敛,完成对人群计数网络的训练。
由于样本图像中的像素点的像素值非0即1,而第二人群密度图像中的像素点的像素值为大于或等于0且小于或等于1之间的数值。因此,依据用样本图像与第二人群密度图像之间的差异确定人群计数网络的网络损失存在较大的差异。
由于真实人群密度图像中像素点的像素值的取值范围也为大于或等于0且小于或等于1之间的数值,可选的,可将样本图像的真实人群密度图像作为监督信息,依据真实人群密度图像与第二人群密度图像之间的差异确定人群计数网络的网络损失,以提高获得的网 络损失的精度。
在一种可能实现的方式中,依据脉冲函数、高斯核以及样本图像,可获得上述样本图像的真实人群密度图像。
在该种可能实现的方式中,可依据冲击函数获得样本图像的人物标签图像,该人物标签图像中像素点的像素值用于表征像素点是否属于人物覆盖的图像区域。上述人物标签图像满足下式:
Figure PCTCN2019125297-appb-000003
N为样本图像中的总人数。x i为人物覆盖的图像区域的中心在样本图像中的位置,用于表示该人物。δ(x-x i)为样本图像中人物覆盖的图像区域的中心在样本图像中的位置的冲击函数。若样本图像中的x处有人物,则δ(x)等于1,若样本图像中的x处没有人物,则δ(x)等于0。
使用高斯核对上述人物标签图像进行卷积处理,可获得样本图像的真实人群密度图像,该过程满足下式:
Figure PCTCN2019125297-appb-000004
其中,σ i=βd i…公式(4)
Figure PCTCN2019125297-appb-000005
上述
Figure PCTCN2019125297-appb-000006
为高斯核,σ i为该高斯核的标准差。β为正数。d i为距离人物x i最近的m个人物与x i之间的距离的平均值。显然,d i越大,与d i对应的人物覆盖的图像区域的人群密度也越大。由于样本图像中远处的人物的d i比近处的人物的d i小,通过使高斯核的标准差满足σ i=βd i,可使高斯核的标准差与人物覆盖的图像区域的尺度呈正相关,即样本图像中不同图像区域对应的高斯核的标准差不同。这样,通过使用高斯核对样本图像进行卷积处理获得的真实人群密度图像的精确度更高。
举例来说,公式(3)中的x i为样本图像中人物的头部覆盖的图像区域的中心(下文将称为人头区域的中心)在样本图像中的位置,δ(x-x i)为样本图像中人头区域的中心的位置的冲击函数。若样本图像中的x处有人头,则δ(x)等于1,若样本图像中的x处没有人头,则δ(x)等于0。基于公式(4)使用高斯核对上述人物标签图像进行卷积处理,得到样本图像的真实人群密度图像。对人物标签图像中的第i个人头进行卷积处理所使用的高斯核的标准差满足σ i=βd i,其中,d i为人物标签图像中的第i个人头的中心与m个目标人头的中心(此处的目标人头为人物标签图像中距离第i个人头最近的人头)之间的平均距离,通常情况头部的大小与两个相邻的人在拥挤的场景中的中心之间的距离有关,d i在人群较密的情况下近似等于人头大小。由于人物标签图像中“近”处的人头覆盖的图像区域的面积比“远”出的人头覆盖的图像区域的面积大,也就是说,人物标签图像中“近”处的两个人头的中心之间的距离比“远”出的两个人头的中心之间的距离大,通过使高斯核的标准差满足σ i=βd i,可达到使高斯核的标准差与人物的头部覆盖的图像区域的尺度呈正相关的效果。
在获得样本图像的真实人群密度图像后,可依据真实人群密度图像中与第二人群密度图像中相同位置的像素点的像素值之间的差异,确定人群计数网络的网络损失。例如将真实人群密度图像中与第二人群密度图像中所有的相同位置的像素点的像素值之间的差异的和作为人群计数网络的网络损失。
可选的,在将样本图像输入至人群计数网络之前,可对样本图像进行预处理,获得至少一张预处理后的图像,并将上述至少一张预处理后的图像作为训练数据输入至人群计数网络。这样,可达到扩充人群计数网络的训练数据集的效果。
上述预处理包括从样本图像中截取预定尺寸的图像、对样本图像或所述预定尺寸的图像进行翻转处理中的至少一种。其中,预定大小可以为64*64。对样本图像进行翻转处理 包括:水平镜面翻转处理。
例如,分别沿样本图像的水平中轴线和竖直中轴线对样本图像进行划分,可获得4张预处理后的图像。同时从样本图像中随机截取5张预定尺寸的图像,可获得5张预处理后的图像。至此,已获得9张预处理后的图像。对该9张预处理后的图像进行水平镜面翻转处理,可获得9张翻转后的图像,即另外9张预处理后的图像。这样即可获得18张预处理后的图像。
通过将至少一张预处理后的图像输入至人群计数网络,可获得至少一张第三人群密度图像,其中,每一张预处理后的图像均对应有一张第三人群密度图像。例如(例2),将图像A、图像B、图像C这3张预处理后的图像分别输入至人群计数网络,将分别获得与图像A对应的人群密度图像a,与图像B对应的人群密度图像b,图像C对应的人群密度图像c。其中,人群密度图像a、人群密度图像b、人群密度图像c均可称为第三人群密度图像。
依据至少一张预处理后的图像中的目标图像和与目标图像对应的第三人群密度图像之间的差异,可获得人群计数网络的网络损失。接着例2继续举例,依据图像A与图像a之间的差异可获得第一差异,依据图像B与图像b之间的差异可获得第二差异,依据图像C与图像c之间的差异可获得第三差异。对第一差异、第二差异和第三差异求和可获得人群计数网络的网络损失。
本实施例提供了一种人群计数网络,使用该人群计数网络对待处理图像进行处理,可获得与待处理图像对应的人群密度图像,进而可确定待处理图像中的人数。
基于本申请实施例提供的技术方案,本申请实施例还提供了几种可能实现的应用场景:
场景A:如上所述,在公共场所常因人流量过多导致人群过于密集的情况的发生,进而发生一些公共事故,如何对公共场所进行人群计数就具有非常大的意义。
目前,为了增强工作、生活或者社会环境中的安全性,会在各个公共场所内安装监控摄像设备,以便根据视频流信息进行安全防护。利用本申请实施例提供的技术方案对监控摄像设备采集到的视频流进行处理,可确定公共场所的人数,进而可有效预防公共事故的发生。
举例来说,监控摄像设备的视频流处理中心的服务器可执行本申请实施例提供的技术方案,该服务器可与至少一个监控摄像头相连。服务器在获取到监控摄像头发送的视频流后,可采用本申请实施例提供的技术方案对视频流中的每一帧图像进行处理,以确定视频流中的每一帧图像中的人数。在图像中的人数大于或等于人数阈值的情况下,服务器可向相关设备发送指令,以进行提示或报警。例如,服务器可向采集该图像的摄像头发送指令,该指令用于指示采集该图像的摄像头进行报警。又例如,服务器可向采集该图像的摄像头所在的区域的管控人员的终端发送指令,该指令用于提示该终端输出人数超过人数阈值的提示信息。
场景B:商场中不同区域的人流量不同,将主推商品放置于人流量多的区域进行展示可有效提高主推商品的销量,因此,如何准确确定商场不同区域的人流量对商家来说具有非常重要的意义。例如,商场中有区域A、区域B和区域C,其中区域B的人流量最大,基于此,商家可将主推商品放置于区域B进行展示,以提高主推商品的销量。
商场的监控摄像头的视频流的管控中心的服务器可执行本申请实施例提供的技术方案,该服务器可与至少一个监控摄像头相连。服务器在获取到监控摄像头发送的视频流后,可采用本申请实施例提供的技术方案对视频流中的每一帧图像进行处理,以确定视频流中的每一帧图像中的人数。依据每一帧图像中的人数可确定不同摄像头监控的区域在某一时间段内的人流量,进而可确定商场内的不同区域的人流量。例如,商场中有区域A、区域B、区域C,摄像头A、摄像头B和摄像头C,其中,摄像头A监控区域A,摄像头B监控区域B,摄像头C监控区域C。服务器使用本申请实施例提供的技术方案对摄像头A采集到的视频流中的图像进行处理,确定区域A在过去一个星期内平均每天的人流量为900,确定区域B在过去一个星期内平均每天的人流量为200,确定区域C在过去一个星期内平均每天的人流量为600。显然,区域A的人流量最多,因此商家可将主推商品放置于区域A内进行展示,以提高主推商品的销量。
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。
上述详细阐述了本申请实施例的方法,下面提供了本申请实施例的装置。
请参阅图10,图10为本申请实施例提供的一种图像处理装置的结构示意图,该装置1包括:获取单元11、卷积处理单元12、融合处理单元13、特征提取处理单元14、第一确定单元15、第二确定单元16以及训练单元17。其中:
获取单元11,用于获取待处理图像、第一卷积核和第二卷积核,所述第一卷积核的感受野与所述第二卷积核的感受野不同;
卷积处理单元12,用于使用所述第一卷积核对所述待处理图像进行卷积处理获得第一特征图像,使用所述第二卷积核对所述待处理图像进行卷积处理获得第二特征图像;
融合处理单元13,用于对所述第一特征图像和所述第二特征图像进行融合处理,获得第一人群密度图像。
在一种可能实现的方式中,所述装置1还包括:
特征提取处理单元14,用于在所述对所述第一特征图像和所述第二特征图像进行融合处理,获得第一人群密度图像之前,对所述待处理图像进行第一特征提取处理,获得第一自注意力图像,对所述待处理图像进行第二特征提取处理,获得第二自注意力图像,所述第一自注意力图像和所述第二自注意力图像均用于表征所述待处理图像的尺度信息,且所述第一自注意力图像所表征的尺度信息与所述第二自注意力图像所表征的尺度信息不同;
第一确定单元15,用于依据所述第一自注意力图像确定所述第一特征图像的第一权重,依据所述第二自注意力图像确定所述第二特征图像的第二权重;
所述融合处理单元13用于:
依据所述第一权重和所述第二权重对所述第一特征图像和所述第二特征图像进行融合处理,获得所述第一人群密度图像。
在另一种可能实现的方式中,所述融合处理单元13具体用于:
确定所述第一权重与所述第一特征图像之间的点积,获得第三特征图像;
确定所述第二权重与所述第二特征图像之间的点积,获得第四特征图像;
对所述第三特征图像和所述第四特征图像进行融合处理,获得所述第一人群密度图像。
在又一种可能实现的方式中,所述第一确定单元15用于:
对所述第一自注意力图像和所述第二自注意力图像进行归一化处理,获得所述第一自注意力图像对应的第三自注意力图像和所述第二自注意力图像对应的第四自注意力图像;
将所述第三自注意力图像作为所述第一权重,将所述第四自注意力图像作为所述第二权重。
在又一种可能实现的方式中,所述特征提取处理单元14,还用于在所述使用所述第一卷积核对所述待处理图像进行卷积处理获得第一特征图像,使用所述第二卷积核对所述待处理图像进行卷积处理获得第二特征图像之前,对所述待处理图像进行第三特征提取处理,获得第五特征图像;
所述卷积处理单元12用于:
使用所述第一卷积核对所述第五特征图像进行卷积处理获得所述第一特征图像,使用所述第二卷积核对所述第五特征图像进行卷积处理获得所述第二特征图像;
所述特征提取处理单元14还用于:
对所述第五特征图像进行所述第一特征提取处理,获得所述第一自注意力图像,对所述第五特征图像进行所述第二特征提取处理,获得所述第二自注意力图像。
在又一种可能实现的方式中,所述第一卷积核和所述第二卷积核均为空洞卷积核,且所述第一卷积核的大小与所述第二卷积核的大小相同,且所述第一卷积核的权重与所述第二卷积核的权重相同,且所述第一卷积核的扩张率与所述第二卷积核的扩张率不同。
在又一种可能实现的方式中,所述第一卷积核或所述第二卷积核的扩张率为参考值。
在又一种可能实现的方式中,所述装置1还包括:第二确定单元16,用于确定所述第一人群密度图像中的像素值的和,获得所述待处理图像中的人数。
在又一种可能实现的方式中,所述装置1执行的图像处理方法应用于人群计数网络;
所述装置1还包括:训练单元17,用于对所述人群计数网络进行训练,所述人群计数网络的训练过程包括:
获取样本图像;
使用所述人群计数网络对所述样本图像进行处理,获得第二人群密度图像;
依据所述样本图像与所述第二人群密度图像之间的差异,获得网络损失;
基于所述网络损失调整所述人群计数网络的参数。
在又一种可能实现的方式中,所述训练单元17还用于:
在所述依据所述样本图像与所述第二人群密度图像之间的差异,获得网络损失之前,依据冲击函数、高斯核以及所述样本图像,获得所述样本图像的真实人群密度图像;
依据所述真实人群密度图像与所述第二人群密度图像之间的差异,获得所述网络损失。
在又一种可能实现的方式中,所述训练单元17还用于:
在所述经所述人群计数网络对所述样本图像进行处理,获得第二人群密度图像之前,对所述样本图像进行预处理,获得至少一张预处理后的图像;
使用所述人群计数网络对所述至少一张预处理后的图像进行处理,获得至少一张第三人群密度图像,所述预处理后的图像与所述第三人群密度图像一一对应;
依据所述至少一张预处理后的图像中的目标图像和与所述目标图像对应的第三人群密度图像之间的差异,获得所述网络损失。
在又一种可能实现的方式中,所述预处理包括:从所述样本图像中截取预定尺寸的图像、对所述样本图像或所述预定尺寸的图像进行翻转处理中的至少一种。
本实施例通过使用感受野不同的第一卷积核和第二卷积核分别对待处理图像进行卷积处理,以提取出不同尺度下的描述待处理图像的内容的信息,分别获得第一特征图像和第二特征图像。通过对第一特征图像和第二特征图像进行融合处理,以利用不同尺度下的描述待处理图像的内容的信息,提高获得的与待处理图像对应的人群密度图像的精度,进而提升获得的待处理图像中人数的精度。
在一些实施例中,本公开实施例提供的装置具有的功能或包含的模块可以用于执行上文方法实施例描述的方法,其具体实现可以参照上文方法实施例的描述,为了简洁,这里不再赘述。
图11为本申请实施例提供的一种图像处理装置的硬件结构示意图。该图像处理装置2包括处理器21,存储器22,还可以包括输入装置23,输出装置24。该处理器21、存储器22、输入装置23和输出装置24通过连接器相耦合,该连接器包括各类接口、传输线或总线等等,本申请实施例对此不作限定。应当理解,本申请的各个实施例中,耦合是指通过特定方式的相互联系,包括直接相连或者通过其他设备间接相连,例如可以通过各类接口、传输线、总线等相连。
处理器21可以是一个或多个图形处理器(graphics processing unit,GPU),在处理器21是一个GPU的情况下,该GPU可以是单核GPU,也可以是多核GPU。可选的,处理器21可以是多个GPU构成的处理器组,多个处理器之间通过一个或多个总线彼此耦合。可选的,该处理器还可以为其他类型的处理器等等,本申请实施例不作限定。
存储器22可用于存储计算机程序指令,以及用于执行本申请方案的程序代码在内的各类计算机程序代码。可选地,存储器包括但不限于是随机存储记忆体(random access memory,RAM)、只读存储器(read-only memory,ROM)、可擦除可编程只读存储器(erasable programmable read only memory,EPROM)、或便携式只读存储器(compact disc read-only memory,CD-ROM),该存储器用于相关指令及数据。
输入装置23用于输入数据和信号,以及输出装置24用于输出数据和信号。输入装置23和输出装置24可以是独立的器件,也可以是一个整体的器件。
可理解,本申请实施例中,存储器22不仅可用于存储相关指令,还可用于存储相关图像,如该存储器22可用于存储通过输入装置23获取的待处理图像,又或者该存储器22还可用于存储通过处理器21获得的第一人群密度图像等等,本申请实施例对于该存储器中具体所存储的数据不作限定。
可以理解的是,图11仅仅示出了图像处理装置的简化设计。在实际应用中,图像处理装置还可以分别包含必要的其他元件,包含但不限于任意数量的输入/输出装置、处理器、存储器等,而所有可以实现本申请实施例的图像处理装置都在本申请的保护范围之内。
本申请实施例还提供了一种处理器,该处理器的缓存中可存储计算机程序,当该计算机程序被该处理器执行时,该处理器可执行实施例(一)和实施例(二)所提供的技术方案、或实现已训练的人群计数网络对待处理图像的处理。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。所属领域的技术人员还可以清楚地了解到,本申请各个实施例描述各有侧重,为描述的方便和简洁,相同或类似的部分在不同实施例中可能没有赘述,因此,在某一实施例未描述或未详细描述的部分可以参见其他实施例的记载。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者通过所述计算机可读存储介质进行传输。所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,数字通用光盘(digital versatile disc,DVD))、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,该流程可以由计算机程序来指令相关的硬件完成,该程序可存储于易失性和非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法实施例的流程。而前述的存储介质包括:只读存储器(read-only memory,ROM)或随机存储存储器(random access memory,RAM)、磁碟或者光盘等各种可存储程序代码的介质。

Claims (28)

  1. 一种图像处理方法,其特征在于,所述方法包括:
    获取待处理图像、第一卷积核和第二卷积核,所述第一卷积核的感受野与所述第二卷积核的感受野不同;
    使用所述第一卷积核对所述待处理图像进行卷积处理获得第一特征图像,使用所述第二卷积核对所述待处理图像进行卷积处理获得第二特征图像;
    对所述第一特征图像和所述第二特征图像进行融合处理,获得第一人群密度图像。
  2. 根据权利要求1所述的方法,其特征在于,在所述对所述第一特征图像和所述第二特征图像进行融合处理,获得第一人群密度图像之前,所述方法还包括:
    对所述待处理图像进行第一特征提取处理,获得第一自注意力图像,对所述待处理图像进行第二特征提取处理,获得第二自注意力图像,所述第一自注意力图像和所述第二自注意力图像均用于表征所述待处理图像的尺度信息,且所述第一自注意力图像所表征的尺度信息与所述第二自注意力图像所表征的尺度信息不同;
    依据所述第一自注意力图像确定所述第一特征图像的第一权重,依据所述第二自注意力图像确定所述第二特征图像的第二权重;
    所述对所述第一特征图像和所述第二特征图像进行融合处理,获得第一人群密度图像,包括:
    依据所述第一权重和所述第二权重对所述第一特征图像和所述第二特征图像进行融合处理,获得所述第一人群密度图像。
  3. 根据权利要求2所述的方法,其特征在于,所述依据所述第一权重和所述第二权重对所述第一特征图像和所述第二特征图像进行融合处理,获得所述第一人群密度图像,包括:
    确定所述第一权重与所述第一特征图像之间的点积,获得第三特征图像;
    确定所述第二权重与所述第二特征图像之间的点积,获得第四特征图像;
    对所述第三特征图像和所述第四特征图像进行融合处理,获得所述第一人群密度图像。
  4. 根据权利要求2或3所述的方法,其特征在于,所述依据所述第一自注意力图像确定所述第一特征图像的第一权重,依据所述第二自注意力图像确定所述第二特征图像的第二权重,包括:
    对所述第一自注意力图像和所述第二自注意力图像进行归一化处理,获得所述第一自注意力图像对应的第三自注意力图像和所述第二自注意力图像对应的第四自注意力图像;
    将所述第三自注意力图像作为所述第一权重,将所述第四自注意力图像作为所述第二权重。
  5. 根据权利要求2至4中任意一项所述的方法,其特征在于,在所述使用所述第一卷积核对所述待处理图像进行卷积处理获得第一特征图像,使用所述第二卷积核对所述待处理图像进行卷积处理获得第二特征图像之前,所述方法还包括:
    对所述待处理图像进行第三特征提取处理,获得第五特征图像;
    所述使用所述第一卷积核对所述待处理图像进行卷积处理获得第一特征图像,使用所述第二卷积核对所述待处理图像进行卷积处理获得第二特征图像,包括:
    使用所述第一卷积核对所述第五特征图像进行卷积处理获得所述第一特征图像,使用所述第二卷积核对所述第五特征图像进行卷积处理获得所述第二特征图像;
    所述对所述待处理图像进行第一特征提取处理,获得第一自注意力图像,对所述待处理图像进行第二特征提取处理,获得第二自注意力图像,包括:
    对所述第五特征图像进行所述第一特征提取处理,获得所述第一自注意力图像,对所述第五特征图像进行所述第二特征提取处理,获得所述第二自注意力图像。
  6. 根据权利要求1至5中任意一项所述的方法,其特征在于,所述第一卷积核和所述第二卷积核均为空洞卷积核,且所述第一卷积核的大小与所述第二卷积核的大小相同,且所述第一卷积核的权重与所述第二卷积核的权重相同,且所述第一卷积核的扩张率与所述第二卷积核的扩张率不同。
  7. 根据权利要求6所述的方法,其特征在于,所述第一卷积核或所述第二卷积核的扩张率为参考值。
  8. 根据权利要求1至7中任意一项所述的方法,其特征在于,所述方法还包括:确定所述第一人群密度图像中的像素值的和,获得所述待处理图像中的人数。
  9. 根据权利要求1至8中任意一项所述的方法,其特征在于,所述方法应用于人群计数网络;
    所述人群计数网络的训练过程包括:
    获取样本图像;
    使用所述人群计数网络对所述样本图像进行处理,获得第二人群密度图像;
    依据所述样本图像与所述第二人群密度图像之间的差异,获得网络损失;
    基于所述网络损失调整所述人群计数网络的参数。
  10. 根据权利要求9所述的方法,其特征在于,在所述依据所述样本图像与所述第二人群密度图像之间的差异,获得网络损失之前,所述方法还包括:
    获得所述样本图像的真实人群密度图像;
    所述依据所述样本图像与所述第二人群密度图像之间的差异,获得网络损失,包括:
    依据所述真实人群密度图像与所述第二人群密度图像之间的差异,获得所述网络损失。
  11. 根据权利要求9所述的方法,其特征在于,在所述经所述人群计数网络对所述样本图像进行处理,获得第二人群密度图像之前,所述方法还包括:
    对所述样本图像进行预处理,获得至少一张预处理后的图像;
    所述经所述人群计数网络对所述样本图像进行处理,获得第二人群密度图像,包括:
    使用所述人群计数网络对所述至少一张预处理后的图像进行处理,获得至少一张第三人群密度图像,所述预处理后的图像与所述第三人群密度图像一一对应;
    所述依据所述样本图像与所述第二人群密度图像之间的差异,获得网络损失,包括:
    依据所述至少一张预处理后的图像中的目标图像和与所述目标图像对应的第三人群密度图像之间的差异,获得所述网络损失。
  12. 根据权利要求11所述的方法,其特征在于,所述预处理包括:从所述样本图像中截取预定尺寸的图像、对所述样本图像或所述预定尺寸的图像进行翻转处理中的至少一种。
  13. 一种图像处理装置,其特征在于,所述装置包括:
    获取单元,用于获取待处理图像、第一卷积核和第二卷积核,所述第一卷积核的感受野与所述第二卷积核的感受野不同;
    卷积处理单元,用于使用所述第一卷积核对所述待处理图像进行卷积处理获得第一特征图像,使用所述第二卷积核对所述待处理图像进行卷积处理获得第二特征图像;
    融合处理单元,用于对所述第一特征图像和所述第二特征图像进行融合处理,获得第一人群密度图像。
  14. 根据权利要求13所述的装置,其特征在于,所述装置还包括:
    特征提取处理单元,用于在所述对所述第一特征图像和所述第二特征图像进行融合处理,获得第一人群密度图像之前,对所述待处理图像进行第一特征提取处理,获得第一自注意力图像,对所述待处理图像进行第二特征提取处理,获得第二自注意力图像,所述第一自注意力图像和所述第二自注意力图像均用于表征所述待处理图像的尺度信息,且所述第一自注意力图像所表征的尺度信息与所述第二自注意力图像所表征的尺度信息不同;
    第一确定单元,用于依据所述第一自注意力图像确定所述第一特征图像的第一权重,依据所述第二自注意力图像确定所述第二特征图像的第二权重;
    所述融合处理单元用于:
    依据所述第一权重和所述第二权重对所述第一特征图像和所述第二特征图像进行融合处理,获得所述第一人群密度图像。
  15. 根据权利要求14所述的装置,其特征在于,所述融合处理单元具体用于:
    确定所述第一权重与所述第一特征图像之间的点积,获得第三特征图像;
    确定所述第二权重与所述第二特征图像之间的点积,获得第四特征图像;
    对所述第三特征图像和所述第四特征图像进行融合处理,获得所述第一人群密度图像。
  16. 根据权利要求14或15所述的装置,其特征在于,所述第一确定单元用于:
    对所述第一自注意力图像和所述第二自注意力图像进行归一化处理,获得所述第一自注意力图像对应的第三自注意力图像和所述第二自注意力图像对应的第四自注意力图像;
    将所述第三自注意力图像作为所述第一权重,将所述第四自注意力图像作为所述第二权重。
  17. 根据权利要求14至16中任意一项所述的装置,其特征在于,所述特征提取处理单元,还用于在所述使用所述第一卷积核对所述待处理图像进行卷积处理获得第一特征图像,使用所述第二卷积核对所述待处理图像进行卷积处理获得第二特征图像之前,对所述待处理图像进行第三特征提取处理,获得第五特征图像;
    所述卷积处理单元用于:
    使用所述第一卷积核对所述第五特征图像进行卷积处理获得所述第一特征图像,使用所述第二卷积核对所述第五特征图像进行卷积处理获得所述第二特征图像;
    所述特征提取处理单元还用于:
    对所述第五特征图像进行所述第一特征提取处理,获得所述第一自注意力图像,对所述第五特征图像进行所述第二特征提取处理,获得所述第二自注意力图像。
  18. 根据权利要求13至17中任意一项所述的装置,其特征在于,所述第一卷积核和所述第二卷积核均为空洞卷积核,且所述第一卷积核的大小与所述第二卷积核的大小相同,且所述第一卷积核的权重与所述第二卷积核的权重相同,且所述第一卷积核的扩张率与所述第二卷积核的扩张率不同。
  19. 根据权利要求18所述的装置,其特征在于,所述第一卷积核或所述第二卷积核的扩张率为参考值。
  20. 根据权利要求13至19中任意一项所述的装置,其特征在于,所述装置还包括:第二确定单元,用于确定所述第一人群密度图像中的像素值的和,获得所述待处理图像中的人数。
  21. 根据权利要求12至20中任意一项所述的装置,其特征在于,所述装置执行的图像处理方法应用于人群计数网络;
    所述装置还包括:训练单元,用于对所述人群计数网络进行训练,所述人群计数网络的训练过程包括:
    获取样本图像;
    使用所述人群计数网络对所述样本图像进行处理,获得第二人群密度图像;
    依据所述样本图像与所述第二人群密度图像之间的差异,获得网络损失;
    基于所述网络损失调整所述人群计数网络的参数。
  22. 根据权利要求21所述的装置,其特征在于,所述训练单元还用于:
    在所述依据所述样本图像与所述第二人群密度图像之间的差异,获得网络损失之前,依据冲击函数、高斯核以及所述样本图像,获得所述样本图像的真实人群密度图像;
    依据所述真实人群密度图像与所述第二人群密度图像之间的差异,获得所述网络损失。
  23. 根据权利要求21所述的装置,其特征在于,所述训练单元还用于:
    在所述经所述人群计数网络对所述样本图像进行处理,获得第二人群密度图像之前,对所述样本图像进行预处理,获得至少一张预处理后的图像;
    使用所述人群计数网络对所述至少一张预处理后的图像进行处理,获得至少一张第三人群密度图像,所述预处理后的图像与所述第三人群密度图像一一对应;
    依据所述至少一张预处理后的图像中的目标图像和与所述目标图像对应的第三人群密度图像之间的差异,获得所述网络损失。
  24. 根据权利要求23所述的装置,其特征在于,所述预处理包括:从所述样本图像中截取预定尺寸的图像、对所述样本图像或所述预定尺寸的图像进行翻转处理中的至少一种。
  25. 一种处理器,其特征在于,所述处理器用于执行如权利要求1至12中任意一项所述的方法。
  26. 一种电子设备,其特征在于,包括:相互连接的处理器和存储器,所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令,当所述处理器执行所述计算机指令时,所述电子设备执行如权利要求1至12中任一项所述的方法。
  27. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被电子设备的处理器执行时,使所述处理器执行权利要求1至12中任意一项所述的方法。
  28. 一种包含指令的计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得计算机执行权利要求1至12中任意一项所述的方法。
PCT/CN2019/125297 2019-11-27 2019-12-13 图像处理方法及装置、处理器、电子设备、存储介质 WO2021103187A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
SG11202106680UA SG11202106680UA (en) 2019-11-27 2019-12-13 Method and device for image processing, processor, electronic equipment and storage medium
KR1020217013985A KR20210075140A (ko) 2019-11-27 2019-12-13 이미지 처리 방법 및 장치, 프로세서, 전자 기기, 저장 매체
JP2021521482A JP2022516398A (ja) 2019-11-27 2019-12-13 画像処理方法及び画像処理装置、プロセッサ、電子機器並びに記憶媒体
US17/348,878 US20210312192A1 (en) 2019-11-27 2021-06-16 Method and device for image processing and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911182723.7 2019-11-27
CN201911182723.7A CN110956122B (zh) 2019-11-27 2019-11-27 图像处理方法及装置、处理器、电子设备、存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/348,878 Continuation US20210312192A1 (en) 2019-11-27 2021-06-16 Method and device for image processing and storage medium

Publications (1)

Publication Number Publication Date
WO2021103187A1 true WO2021103187A1 (zh) 2021-06-03

Family

ID=69978585

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/125297 WO2021103187A1 (zh) 2019-11-27 2019-12-13 图像处理方法及装置、处理器、电子设备、存储介质

Country Status (7)

Country Link
US (1) US20210312192A1 (zh)
JP (1) JP2022516398A (zh)
KR (1) KR20210075140A (zh)
CN (1) CN110956122B (zh)
SG (1) SG11202106680UA (zh)
TW (1) TWI752466B (zh)
WO (1) WO2021103187A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117021435A (zh) * 2023-05-12 2023-11-10 浙江闽立电动工具有限公司 修边机的修边控制系统及其方法

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639523B (zh) * 2020-04-17 2023-07-07 北京迈格威科技有限公司 目标检测方法、装置、计算机设备和存储介质
CN111652152A (zh) * 2020-06-04 2020-09-11 上海眼控科技股份有限公司 人群密度检测方法、装置、计算机设备和存储介质
CN111652161A (zh) * 2020-06-08 2020-09-11 上海商汤智能科技有限公司 人群过密预测方法、装置、电子设备及存储介质
CN112115900B (zh) * 2020-09-24 2024-04-30 腾讯科技(深圳)有限公司 图像处理方法、装置、设备及存储介质
CN112434607B (zh) * 2020-11-24 2023-05-26 北京奇艺世纪科技有限公司 特征处理方法、装置、电子设备及计算机可读存储介质
CN113887615A (zh) * 2021-09-29 2022-01-04 北京百度网讯科技有限公司 图像处理方法、装置、设备和介质
CN115115554B (zh) * 2022-08-30 2022-11-04 腾讯科技(深圳)有限公司 基于增强图像的图像处理方法、装置和计算机设备
CN116363598A (zh) * 2023-05-29 2023-06-30 深圳市捷易科技有限公司 人群拥挤预警方法、装置、电子设备及可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160328630A1 (en) * 2015-05-08 2016-11-10 Samsung Electronics Co., Ltd. Object recognition apparatus and method
CN109241895A (zh) * 2018-08-28 2019-01-18 北京航空航天大学 密集人群计数方法及装置
CN109872364A (zh) * 2019-01-28 2019-06-11 腾讯科技(深圳)有限公司 图像区域定位方法、装置、存储介质和医学影像处理设备
CN110020606A (zh) * 2019-03-13 2019-07-16 北京工业大学 一种基于多尺度卷积神经网络的人群密度估计方法
CN110135325A (zh) * 2019-05-10 2019-08-16 山东大学 基于尺度自适应网络的人群人数计数方法及系统

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109313627A (zh) * 2016-03-17 2019-02-05 映佳控制公司 对丢失的输入信息具有鲁棒性的用于处理任务的方法和系统
CN107784654B (zh) * 2016-08-26 2020-09-25 杭州海康威视数字技术股份有限公司 图像分割方法、装置及全卷积网络系统
US10402527B2 (en) * 2017-01-04 2019-09-03 Stmicroelectronics S.R.L. Reconfigurable interconnect
CN108229455B (zh) * 2017-02-23 2020-10-16 北京市商汤科技开发有限公司 物体检测方法、神经网络的训练方法、装置和电子设备
CN106934397B (zh) * 2017-03-13 2020-09-01 北京市商汤科技开发有限公司 图像处理方法、装置及电子设备
WO2018224442A1 (en) * 2017-06-05 2018-12-13 Siemens Aktiengesellschaft Method and apparatus for analysing an image
CN107301387A (zh) * 2017-06-16 2017-10-27 华南理工大学 一种基于深度学习的图像高密度人群计数方法
TWI667621B (zh) * 2018-04-09 2019-08-01 和碩聯合科技股份有限公司 人臉辨識方法
CN108681743B (zh) * 2018-04-16 2019-12-06 腾讯科技(深圳)有限公司 图像对象识别方法和装置、存储介质
CN109858461B (zh) * 2019-02-21 2023-06-16 苏州大学 一种密集人群计数的方法、装置、设备以及存储介质
CN110245659B (zh) * 2019-05-21 2021-08-13 北京航空航天大学 基于前背景相互关系的图像显著对象分割方法及装置
CN110348537B (zh) * 2019-07-18 2022-11-29 北京市商汤科技开发有限公司 图像处理方法及装置、电子设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160328630A1 (en) * 2015-05-08 2016-11-10 Samsung Electronics Co., Ltd. Object recognition apparatus and method
CN109241895A (zh) * 2018-08-28 2019-01-18 北京航空航天大学 密集人群计数方法及装置
CN109872364A (zh) * 2019-01-28 2019-06-11 腾讯科技(深圳)有限公司 图像区域定位方法、装置、存储介质和医学影像处理设备
CN110020606A (zh) * 2019-03-13 2019-07-16 北京工业大学 一种基于多尺度卷积神经网络的人群密度估计方法
CN110135325A (zh) * 2019-05-10 2019-08-16 山东大学 基于尺度自适应网络的人群人数计数方法及系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117021435A (zh) * 2023-05-12 2023-11-10 浙江闽立电动工具有限公司 修边机的修边控制系统及其方法
CN117021435B (zh) * 2023-05-12 2024-03-26 浙江闽立电动工具有限公司 修边机的修边控制系统及其方法

Also Published As

Publication number Publication date
TW202121233A (zh) 2021-06-01
CN110956122B (zh) 2022-08-02
SG11202106680UA (en) 2021-07-29
KR20210075140A (ko) 2021-06-22
TWI752466B (zh) 2022-01-11
US20210312192A1 (en) 2021-10-07
CN110956122A (zh) 2020-04-03
JP2022516398A (ja) 2022-02-28

Similar Documents

Publication Publication Date Title
WO2021103187A1 (zh) 图像处理方法及装置、处理器、电子设备、存储介质
Salama AbdELminaam et al. A deep facial recognition system using computational intelligent algorithms
WO2020199931A1 (zh) 人脸关键点检测方法及装置、存储介质和电子设备
US11238272B2 (en) Method and apparatus for detecting face image
WO2020177673A1 (zh) 一种视频序列选择的方法、计算机设备及存储介质
WO2021063056A1 (zh) 人脸属性识别方法、装置、电子设备和存储介质
CN107679466B (zh) 信息输出方法和装置
CN107636684A (zh) 视频会议中的情绪识别
WO2022041830A1 (zh) 行人重识别方法和装置
CN107679447A (zh) 面部特征点检测方法、装置及存储介质
CN108197592B (zh) 信息获取方法和装置
WO2021164550A1 (zh) 图像分类方法及装置
WO2021051547A1 (zh) 暴力行为检测方法及系统
US20210117687A1 (en) Image processing method, image processing device, and storage medium
US20210012201A1 (en) Center-biased machine learning techniques to determine saliency in digital images
US10133955B2 (en) Systems and methods for object recognition based on human visual pathway
WO2023173646A1 (zh) 一种表情识别方法及装置
WO2021169641A1 (zh) 人脸识别方法和系统
US12008793B2 (en) Object behavior analysis method, information display method, and electronic device
CN109033935B (zh) 抬头纹检测方法及装置
JP2020013553A (ja) 端末装置に適用される情報生成方法および装置
WO2021223738A1 (zh) 模型参数的更新方法、装置、设备及存储介质
WO2022111387A1 (zh) 一种数据处理方法及相关装置
WO2023087420A1 (zh) 一种基于热红外视觉的停机坪人体动作识别方法及系统
CN115205780A (zh) 一种工地违规的监控方法、系统、介质及电子设备

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021521482

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20217013985

Country of ref document: KR

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19954423

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19954423

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 27.09.2022)

WWE Wipo information: entry into national phase

Ref document number: 521422585

Country of ref document: SA

122 Ep: pct application non-entry in european phase

Ref document number: 19954423

Country of ref document: EP

Kind code of ref document: A1