CN117830935A

CN117830935A - Crowd counting model training method and device based on self-adaptive region selection module

Info

Publication number: CN117830935A
Application number: CN202311725292.0A
Authority: CN
Inventors: 石雅洁; 蒋召
Original assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Current assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date: 2023-12-14
Filing date: 2023-12-14
Publication date: 2024-04-05

Abstract

The application relates to the technical field of computer vision and provides a crowd counting model training method and device based on a self-adaptive region selection module. The method comprises the following steps: extracting the characteristics of the object pictures in the training set by utilizing a characteristic extraction network to obtain an original characteristic picture; in the branches of the convolutional neural network, transforming the original feature map by using a convolutional module to obtain a deep feature map transformed by the convolutional module; in the conversion branch, the original feature map is converted by using a preset conversion module, so that a deep feature map converted by the preset conversion module is obtained; respectively inputting deep feature images in two branches into adaptive region selection modules of corresponding branches, and performing adaptive processing to obtain a first weighted feature image and a second weighted feature image; further obtaining an output result; finally, calculating the loss function and reversely updating the network to train the crowd counting model. The embodiment of the application solves the problem that the crowd counting effect is poor in a complex scene.

Description

Crowd counting model training method and device based on self-adaptive region selection module

Technical Field

The application relates to the technical field of computer vision, in particular to a crowd counting model training method and device based on a self-adaptive region selection module.

Background

Crowd counting is widely applied in aspects such as video monitoring and public safety, crowd crowding can be known in real time through crowd counting, along with the development of deep learning technology, crowd counting task accuracy is higher and higher, crowd density is perceived by a crowd counting algorithm of a common convolutional neural network nowadays, the algorithm amplifies different local areas and then independently executes a prediction task in the different areas, so that the crowd density can be predicted to a certain extent, but the algorithm is high in complexity, the crowd density is difficult to accurately predict under a scene with complicated crowd, and the crowd counting effect can be influenced under the conditions such as crowd overlapping, shielding and dense crowd.

Therefore, the prior art has the problem of poor crowd counting effect in complex scenes.

Disclosure of Invention

In view of this, the embodiment of the application provides a crowd counting model training method and device based on an adaptive region selection module, so as to solve the problem of poor crowd counting effect in a complex scene in the prior art.

In a first aspect of an embodiment of the present application, a crowd counting model training method based on an adaptive region selection module is provided, including: extracting the characteristics of the object pictures in the training set by utilizing a characteristic extraction network to obtain an original characteristic picture; the original feature map is respectively input into a convolutional neural network branch and a conversion branch; in the branches of the convolutional neural network, transforming the original feature map by using a convolutional module to obtain a deep feature map transformed by the convolutional module; in the conversion branch, the original feature map is converted by using a preset conversion module, so that a deep feature map converted by the preset conversion module is obtained; respectively inputting the deep feature map transformed by the convolution module and the deep feature map transformed by the preset conversion module into an adaptive region selection module of the corresponding branch for adaptive processing to obtain a first weighted feature map and a second weighted feature map; obtaining an output result according to the first weighted feature map and the second weighted feature map; and calculating a loss function according to the output result and the real label, and reversely updating network parameters by using the loss function to obtain a trained crowd counting model.

In a second aspect of the embodiments of the present application, a crowd counting model training device based on an adaptive region selection module is provided, including: the extraction module is configured to extract the characteristics of the object pictures in the training set by utilizing the characteristic extraction network to obtain an original characteristic diagram; the input module is configured to input the original feature map to the convolution neural network branch and the conversion branch respectively; the first transformation module is configured to transform the original feature map by utilizing the convolution module in the branches of the convolution neural network to obtain a deep feature map transformed by the convolution module; the second transformation module is configured to transform the original feature map by using the preset transformation module in the transformation branch to obtain a deep feature map transformed by the preset transformation module; the weighting module is configured to input the deep feature map transformed by the convolution module and the deep feature map transformed by the preset conversion module into the adaptive region selection module of the corresponding branch respectively, and perform adaptive processing to obtain a first weighted feature map and a second weighted feature map; the output module is configured to obtain an output result according to the first weighted feature map and the second weighted feature map; and the updating module is configured to calculate a loss function according to the output result and the real label, and reversely update network parameters by using the loss function to obtain a trained crowd counting model.

In a third aspect of the embodiments of the present application, there is provided an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a fourth aspect of the embodiments of the present application, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.

The above-mentioned at least one technical scheme that this application embodiment adopted can reach following beneficial effect:

extracting the characteristics of the object pictures in the training set to obtain an original characteristic picture, and respectively inputting the original characteristic picture into a convolutional neural network branch and a conversion branch; in the branches of the convolutional neural network, transforming the original feature map by using a convolutional module to obtain a deep feature map transformed by the convolutional module; in the conversion branch, the original feature map is converted by using a preset conversion module, so that a deep feature map converted by the preset conversion module is obtained; then the deep feature images transformed by the convolution module and the deep feature images transformed by the preset conversion module are respectively sent to the adaptive region selection modules corresponding to the two branches for adaptive processing, so that corresponding first weighted feature images and second weighted feature images are obtained; and adding the first weighted feature map and the second weighted feature map to obtain an output result of the crowd counting model training, calculating a loss function by using the output result and a real label, and reversely updating network parameters according to the loss function, so that the network can better fit training data, and repeatedly executing the training process until the crowd counting model with an accurate output result is obtained. The convolution neural network branches and the conversion branches are designed, local features are processed by the convolution modules, global dependence is established by the preset conversion modules, adaptive region selection modules are arranged in the two branches, the interested regions are determined through learning of the adaptive region selection modules, feature processing is carried out on the interested regions, namely the adaptive region selection modules of the corresponding branches carry out adaptive processing on input, and output results of the corresponding branches are obtained according to adaptive processing results of the corresponding branches. And then fusing the output results of the corresponding branches to obtain an output result of crowd counting model training, calculating a loss function according to the output result and a real label, and repeatedly training according to the loss function to obtain a crowd counting model capable of outputting an accurate crowd counting result.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly introduce the drawings that are needed in the embodiments or the description of the prior art, it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a crowd counting model training method based on an adaptive region selection module according to an embodiment of the present application;

FIG. 2 is a flowchart of another crowd counting model training method based on an adaptive region selection module according to an embodiment of the present application;

FIG. 3 is a flowchart of another crowd counting model training method based on an adaptive region selection module according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a crowd counting model training device based on an adaptive region selection module according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type and not limited to the number of objects, e.g., the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

Furthermore, it should be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

A crowd counting model training method and device based on an adaptive region selection module according to an embodiment of the application will be described in detail below with reference to the accompanying drawings.

Fig. 1 is a flow chart of a crowd counting model training method based on an adaptive region selection module according to an embodiment of the present application. As shown in fig. 1, the crowd counting model training method based on the adaptive region selection module includes:

s101, extracting the characteristics of the object picture in the training set by utilizing a characteristic extraction network to obtain an original characteristic diagram;

s102, respectively inputting the original feature map into a convolutional neural network branch and a conversion branch;

s103, in the branches of the convolutional neural network, transforming the original feature map by using a convolutional module to obtain a deep feature map transformed by the convolutional module;

s104, in the conversion branch, converting the original feature map by using a preset conversion module to obtain a deep feature map converted by the preset conversion module;

s105, respectively inputting the deep feature map transformed by the convolution module and the deep feature map transformed by the preset conversion module into the adaptive region selection module of the corresponding branch for adaptive processing to obtain a first weighted feature map and a second weighted feature map;

S106, obtaining an output result according to the first weighted feature map and the second weighted feature map;

and S107, calculating a loss function according to the output result and the real label, and reversely updating network parameters by using the loss function to obtain a trained crowd counting model.

The real label can represent the real crowd density of the object picture, the object picture can be a crowd image in a training set in the training process of the crowd counting model, and the object picture can be a crowd image to be predicted after the training of the crowd counting model is completed.

In some embodiments, transforming the original feature map with a convolution module to obtain a deep feature map transformed with the convolution module, including: and carrying out convolution operation and pooling operation on the original feature map by utilizing a plurality of residual error modules which are continuously stacked in the convolution module to obtain a deep feature map transformed by the convolution module.

Specifically, in the convolutional neural network branch, the original feature map may be transformed by using a convolution module, that is, by continuously stacking a plurality of residual modules, a deep convolution module may be constructed, and a series of convolution operations and pooling operations may be performed on the original feature map by using the convolution module.

Each residual error module can be composed of two convolution layers, the convolution layers can be connected through residual error connection, a first convolution layer can be used for extracting features in each residual error module, then the features are subjected to nonlinear transformation of an activation function, then the features are subjected to second convolution layer, and finally the output and the input of the two convolution layers are subjected to residual error connection, so that the output of a single residual error module can be obtained. And then sequentially carrying out continuous stacking on the original feature images by using a plurality of residual modules to obtain a deep feature image transformed by the convolution module, namely, in the continuous stacking of the residual modules, each residual module takes the output of the previous layer of residual module as the input until the last residual module is output to obtain the deep feature image transformed by the convolution module. The convolution module may include a stack of 6 consecutive residual modules, but is not limited thereto.

Further, performance of the convolution module in a crowd counting task can be improved by continuously stacking a plurality of residual modules, a corresponding deep feature map is obtained after the transformation of the convolution module, each position corresponds to a local area in the object picture, and then the deep feature map transformed by the convolution module is input to an adaptive area selection module of a convolution neural network branch to perform the crowd counting task.

In addition, a pooling operation can be included in the convolution module, which can reduce the size of the original feature map, thereby reducing network parameters and computation. The pooling operation may be inserted between residual modules or added after the residual modules, which is not limited. The specific pooling layer is not limited herein, but it should be noted that excessive pooling may result in information loss, so that the network structure design needs to perform the setting of pooling operation according to the requirements of specific tasks.

According to the method provided by the embodiment of the application, the original feature map is transformed by using the stacking of the plurality of residual modules in the convolution module in the convolution neural network branch, so that the depth of the network can be increased by continuously stacking the plurality of residual modules in the convolution module and corresponding convolution operation and pooling operation, and a deep feature map is obtained.

In some embodiments, transforming the original feature map by using a preset transformation module to obtain a deep feature map transformed by the preset transformation module, including: and establishing a global dependency relationship of the original feature graphs according to a plurality of original converters which are continuously stacked in the preset conversion module, and obtaining the deep feature graphs transformed by the preset conversion module.

Specifically, in the conversion branch, after the original feature map is received, a global dependency relationship of the original feature map can be established by using a plurality of original converters continuously stacked in the preset conversion module, so as to obtain a deep feature map transformed by the preset conversion module.

The conversion branch may be a converter branch, the preset conversion module may be a converter module, and the original converter may be an original converter structure.

Therefore, in the embodiment of the application, in the transducer branch, a global dependency relationship of the original feature map is established by continuously stacking a plurality of identical original transducer structures, and a corresponding deep feature map is generated.

It should be appreciated that the self-attention mechanism processing can be performed on the input by using the self-attention mechanism in each original transducer structure, that is, the original feature map sequentially performs the self-attention mechanism processing in a plurality of original transducer structures stacked in sequence, and the global dependence of the original feature map can be established by calculating the correlation score between each position and other positions in the original feature map and weighting and fusing the features to capture the global dependence, so that each position in the original feature map can sense the importance of the other positions. And by stacking multiple original transducer structures in succession, higher-level, more abstract feature representations can be extracted step by step, each of which can contain self-attention mechanisms and feed-forward neural networks. The feed forward neural network may be applied after the self-attention mechanism for further processing and combining features.

And taking the output of the upper layer as input in each original transducer structure, and performing self-attention and feedforward neural network operation on the output until a deep feature map transformed by a preset transformation module is output. The number of original transducer structures stacked in succession may be 6, but is not limited thereto.

According to the method, in the conversion branch, the original feature map is converted through the preset conversion module to establish global dependence of the original feature map, and the deep feature map converted through the preset conversion module is obtained, so that the network can be helped to learn abstract features of people in the object picture better, and accuracy of the crowd counting model is improved.

In some embodiments, the deep feature map transformed by the convolution module and the deep feature map transformed by the preset conversion module are respectively input into an adaptive region selection module of a corresponding branch, and adaptive processing is performed to obtain a first weighted feature map and a second weighted feature map, which includes: in the convolutional neural network branch, according to the self-adaptive region selection module of the convolutional neural network branch, carrying out self-adaptive processing on the deep feature map transformed by the convolutional module to obtain a first weighted feature map; and in the conversion branch, carrying out self-adaptive processing on the deep feature map transformed by the preset conversion module according to a self-adaptive region selection module of the conversion branch to obtain a second weighted feature map.

Specifically, the convolutional neural network branch may include a convolutional module and an adaptive region selection module, and the conversion branch may include a preset conversion module and an adaptive region selection module. After the original feature map is respectively input into a convolutional neural network branch and a conversion branch, in the convolutional neural network branch, the deep feature map transformed by the convolutional module can be subjected to self-adaptive processing according to a self-adaptive region selection module of the convolutional neural network branch, so as to obtain a first weighted feature map; in the conversion branch, the deep feature map converted by the preset conversion module can be subjected to self-adaptive processing according to the self-adaptive region selection module of the conversion branch, so as to obtain a second weighted feature map. The first weighted feature map and the second weighted feature map are output values of corresponding branches obtained after the corresponding branches process areas with different densities in the original feature map.

Further, the convolutional neural network branch and the conversion branch can respectively process the sparse region and the dense region in the object picture after continuous training, so that the first weighted feature map can be used for representing the feature map which mainly represents the sparse region after the self-adaption processing of the self-adaption region selection module of the convolutional neural network branch, and the second weighted feature map can be used for representing the feature map which mainly represents the dense region after the self-adaption processing of the self-adaption region selection module in the conversion branch.

Further, the sparse region and the dense region have no clear demarcation standard, autonomous learning can be performed through the attention mechanism in the corresponding self-adaptive region selection module, weights are given to all features in the deep feature map, the sparse region or the dense region is correspondingly given higher weights, a corresponding first weighted feature map and a corresponding second weighted feature map are obtained, finally, a loss function is calculated according to the output results of the first weighted feature map and the second weighted feature map and the real labels, and network parameters are reversely updated, so that the training effect can be achieved.

According to the method, the original feature map is processed through the convolutional neural network branches to obtain the first weighted feature map, the second weighted feature map is obtained through the conversion branches, so that a loss function can be calculated according to a real label according to the addition result of the first weighted feature map and the second weighted feature map, network parameters are reversely updated, a network training effect is achieved, a trained crowd counting model can be obtained, and the method can be more flexibly adapted to different types of crowd scenes according to different network branches, so that the accuracy of crowd counting is not improved.

In some embodiments, performing an adaptive process to obtain a first weighted feature map and a second weighted feature map, including: in the self-adaptive region selection module of the corresponding branch, receiving the corresponding deep feature map and carrying out local maximum pooling treatment to obtain a treated feature map; inputting the processed feature map into a deconvolution layer, and carrying out deconvolution processing to obtain a deconvolution processed feature map; sequentially inputting the feature map subjected to deconvolution to a first convolution layer and a first activation function layer to obtain a first weight feature map; multiplying the feature map subjected to deconvolution with the first weight feature map element by element to obtain an enhanced feature map; sequentially inputting the enhanced feature images to a second convolution layer and a second activation function layer to obtain a second weight feature image; multiplying the second weight feature map with the deep feature map element by element to obtain a weighted feature map; the deep feature map comprises a deep feature map transformed by a convolution module and a deep feature map transformed by a preset conversion module, and the weighted feature map comprises a first weighted feature map and a second weighted feature map.

Specifically, the adaptive region selection module is trained, so that the adaptive region selection modules in different branches can respectively perform adaptive processing on regions with different densities.

Further, in the convolutional neural network branch and the conversion branch, the adaptive region selection module can dynamically adjust the attention degree of the sparse region and the dense region according to the characteristics of the sparse region and the dense region, so as to better process the crowd counting task, and as an example, the convolutional neural network branch can be utilized to mainly process the sparse region, namely, the adaptive region selection module in the convolutional neural network branch is utilized to pay attention to the sparse region. The dense region is mainly processed by the conversion branch, i.e. the dense region is focused by the adaptive region selection module in the conversion branch.

It should be appreciated that the convolutional neural network branches are consistent with the adaptive processing flows in the adaptive region selection module in the conversion branch, so the embodiments of the present application describe the adaptive processing flows of the adaptive region selection module in the two branches in a unified manner.

Fig. 2 is a flow chart of another crowd counting model training method based on the adaptive region selection module according to an embodiment of the present application, and a specific embodiment of the present application is described below with reference to fig. 2.

Specifically, in the adaptive region selection module, the deep feature map in the corresponding branch is received and subjected to local maximum pooling processing to obtain local most effective features, and the processed feature map is obtained, so that the extraction of key features of the sparse region and the dense region is facilitated. And then deconvoluting the processed feature map to improve the size of the feature map after the maximum pooling processing to obtain the feature map after deconvolution processing, wherein the feature map after deconvolution processing can provide more detail information for subsequent feature processing. And then inputting the feature map subjected to deconvolution processing to a convolution layer and a Sigmoid layer, namely a first convolution layer and a first activation function layer to calculate weights, so that a first weight feature map can be obtained, and the first weight map can be used for measuring the importance of areas with different densities so as to realize the distinction of sparse areas and dense areas. Multiplying the first weight characteristic diagram with the characteristic diagram after deconvolution processing output before to obtain an enhanced characteristic diagram so as to emphasize the position with higher weight in the object picture; and then the enhanced feature map is input to the convolution layer and the Sigmoid layer again, namely a second convolution layer and a second activation function layer, so that a second weight feature map is obtained, the importance of the areas with different densities is further emphasized, and the size of the feature map is reduced in the second convolution layer and is restored to the size of the deep feature map. And finally multiplying the second weighted feature map by the deep feature map received by the self-adaptive region selection module to obtain a weighted feature map, wherein the weighted feature map can correspond to the first weighted feature map in a convolutional neural network branch, can correspond to the second weighted feature map in a conversion branch and can be used for representing the feature weights of regions with different densities.

The structures of the first convolution layer and the second convolution layer can be consistent, and the two convolution operations are performed on the input to learn the abstract feature representation, but because the input of the first convolution layer and the input of the second convolution layer are different, the abstract feature representation of different layers can be learned, and the second convolution layer can restore the size of the input to the size before deconvolution processing.

The first activation function layer and the second activation function layer can both utilize Sigmoid functions, nonlinear transformation can be introduced after convolution operation by the Sigmoid functions so as to increase network expression capacity and nonlinear feature extraction capacity, input values in the activation layers can be mapped to a range from 0 to 1 by the Sigmoid functions, output values obtained by the Sigmoid functions can be regarded as corresponding weight feature graphs in the first activation function layer and the second activation function layer, namely, the input values of the first activation function layer can be processed by the Sigmoid functions to obtain a first weight feature graph, and the input values of the second function activation layer can be processed by the Sigmoid functions to obtain a second weight feature graph. It should be appreciated that the input value of the first activation function layer may be the output of the first convolution layer and the input value of the second activation function layer may be the output of the second convolution layer.

The first weighted feature map and the second weighted feature map can reflect the importance degree of features of different positions of the deep feature map in the corresponding branches on the final output result, and can be used for measuring the importance of areas with different densities, so that the distinction between sparse areas and dense areas is realized. That is, if the convolutional neural network branches pay more attention to the sparse region and the conversion branches pay more attention to the dense region, the first weighted feature map obtained by the convolutional neural network branches can emphasize features of the sparse region and give higher weight to features of the sparse region, and likewise, the second weighted feature map obtained by the conversion branches can emphasize features of the dense region and give higher weight to features of the dense region. The adaptive region selection module in different branches can accurately process the characteristics of regions with different densities, the adaptive region selection module needs to be trained repeatedly, namely, the network parameters are updated after the loss function is calculated through the output result and the real label, the training is carried out until the output result of the crowd counting model is consistent with the real label, the training of the crowd counting model can be regarded as being completed, and meanwhile, the training of the adaptive region selection module is also completed, so that the adaptive region selection module can be utilized to help the crowd counting model to better realize the crowd counting task.

According to the method, in the self-adaptive region selection module, the steps of local maximum pooling processing, deconvolution processing, weight calculation, feature map enhancement and the like are sequentially carried out on the received deep feature map, so that the self-adaptive processing of the received deep feature is realized.

In some implementations, obtaining the output result from the first weighted feature map and the second weighted feature map includes: and carrying out element-by-element addition according to the first weighted feature map and the second weighted feature map to obtain an output result.

Specifically, after the first weighted feature map is obtained through the self-adaptive region selection module of the convolutional neural network branch and the second weighted feature map is obtained through the self-adaptive region selection module of the conversion branch, the first weighted feature map and the second weighted feature map can be added element by element to obtain an output result of crowd counting.

Further, the crowd counting model can be trained by using object pictures in the training set, namely, the loss function is calculated by using the output result and the real label, then the network parameters are updated by using a back propagation algorithm, and the characteristic representation of the areas with different densities can be better learned by continuously iterating the training process so as to train the crowd counting model.

Further, the process of updating the network parameters can be mainly performed with respect to the processing effects of the different density regions of the adaptive region selection module, so that the adaptive region selection module of the convolutional neural network branch can perform feature processing with respect to the sparse region through training of the adaptive region selection module of the convolutional neural network branch and the adaptive region selection module of the conversion branch, and the adaptive region selection module of the conversion branch performs feature processing with respect to the dense region.

According to the method of the embodiment, the first weighted feature map and the second weighted feature map are added element by element to obtain an output result, the loss function is lost according to the output result and the real label, and the network parameters are reversely updated according to the loss function, so that the crowd counting model can gradually optimize the processing effect of the self-adaptive region selection module through the continuous iterative training process, and the accuracy of the crowd counting model is improved.

In some embodiments, extracting features of the object picture in the training set by using a feature extraction network to obtain an original feature map includes: and extracting the characteristics of the object picture by using a preset depth convolutional neural network model according to the backbone network of the pre-trained target detection network as an initialization parameter to obtain an original characteristic diagram.

Specifically, a feature extraction network may be used to extract features in the input object picture, where a backbone network of the pre-trained target detection network may be used as an initialization parameter to extract features of the object picture, and obtain an original feature map.

Furthermore, the ResNet50 can be used as a feature extraction network, and the Backbone can be used as a Backbone network to initialize parameters, so that the convergence of the crowd counting model can be accelerated and the effect can be improved.

FIG. 3 is a flow chart of yet another training method for crowd count models based on an adaptive region selection module according to an embodiment of the present application, and the following further describes the embodiment of the present application with reference to FIG. 3:

in the model training process, first, an object picture in a training set is input.

And sending the object pictures in the training set to a feature extraction network to obtain an original feature map, and then respectively inputting the original feature map to a convolutional neural network branch and a conversion branch.

The convolutional neural network branch comprises a convolutional module and an adaptive region selection module, and the conversion branch is a converter branch, and the branch comprises a converter module and an adaptive region selection module.

In the branch of the convolutional neural network, an original feature map is subjected to a series of convolution processing and pooling processing in a convolution module to obtain a deep feature map transformed by the convolution module, and then the deep feature map transformed by the convolution module is subjected to processing by a self-adaptive region selection module to obtain a first weighted feature map representing a sparse region.

In the transition branch, firstly, the original feature map is transformed by a transition module, a global dependency relationship of the original feature map is established, a deep feature map transformed by the transition module is obtained, and then the deep feature map transformed by the transition module is processed by a self-adaptive region selection module, so that a second weighted feature map representing a dense region is obtained.

And then adding the first weighted feature map and the second weighted feature map element by element to obtain an output result of crowd counting.

And calculating a loss function by using the output result and the real label, and reversely updating network parameters according to the loss function, wherein it is understood that the calculation of the loss function and the reversely updating network are only performed in the training process of the crowd counting model, and if the training of the crowd counting model is completed, the crowd counting result can be directly output according to the crowd counting model.

Therefore, the self-adaptive region selection module can be utilized to process the characteristics of regions with different densities, training of the crowd counting model is gradually completed, the crowd counting model capable of processing complex scenes is obtained, and the prediction effect of the crowd counting task in the complex scenes is improved. It should be further appreciated that, although the embodiment of the present application uses the object picture as the training set or the prediction object, the crowd counting model in the embodiment of the present application may also implement the crowd counting task for the video, that is, the crowd counting task may be performed for each frame in the video, and then specific task processing may be performed according to the crowd counting result of the continuous frame, where specific expansion is not performed, and no limitation is also performed.

Any combination of the above optional solutions may be adopted to form an optional embodiment of the present application, which is not described herein in detail.

The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

Fig. 4 is a schematic structural diagram of a crowd counting model training device based on an adaptive region selection module according to an embodiment of the present application. As shown in fig. 4, the crowd counting model training device based on the adaptive region selection module includes:

The extracting module 401 is configured to extract the features of the object pictures in the training set by using the feature extracting network to obtain an original feature map;

an input module 402 configured to input the raw feature map to a convolutional neural network branch and a conversion branch, respectively;

the first transformation module 403 is configured to transform the original feature map by using the convolution module in the branches of the convolution neural network to obtain a deep feature map transformed by the convolution module;

the second transformation module 404 is configured to transform the original feature map by using a preset transformation module in the transformation branch, so as to obtain a deep feature map transformed by the preset transformation module;

the weighting module 405 is configured to input the deep feature map transformed by the convolution module and the deep feature map transformed by the preset conversion module into the adaptive region selection module of the corresponding branch, and perform adaptive processing to obtain a first weighted feature map and a second weighted feature map;

an output module 406 configured to obtain an output result according to the first weighted feature map and the second weighted feature map;

the updating module 407 is configured to calculate a loss function according to the output result and the real label, and reversely update the network parameters by using the loss function to obtain a trained crowd counting model.

In some embodiments, the first transformation module 403 is specifically configured to perform convolution operation and pooling operation on the original feature map by using a plurality of residual modules stacked in succession in the convolution module, so as to obtain a deep feature map transformed by the convolution module.

In some embodiments, the second transformation module 404 is specifically configured to establish a global dependency relationship of the original feature map according to a plurality of original converters stacked in succession in the preset transformation module, so as to obtain a deep feature map transformed by the preset transformation module.

In some embodiments, the weighting module 405 is specifically configured to perform, in the convolutional neural network branch, adaptive processing on the deep feature map transformed by the convolutional module according to the adaptive region selection module of the convolutional neural network branch, to obtain a first weighted feature map; and in the conversion branch, carrying out self-adaptive processing on the deep feature map transformed by the preset conversion module according to a self-adaptive region selection module of the conversion branch to obtain a second weighted feature map.

In some embodiments, the weighting module 405 is specifically configured to receive the corresponding deep feature map and perform local maximum pooling processing in the adaptive region selection module of the corresponding branch, so as to obtain a processed feature map; inputting the processed feature map into a deconvolution layer, and carrying out deconvolution processing to obtain a deconvolution processed feature map; sequentially inputting the feature map subjected to deconvolution to a first convolution layer and a first activation function layer to obtain a first weight feature map; multiplying the feature map subjected to deconvolution with the first weight feature map element by element to obtain an enhanced feature map; sequentially inputting the enhanced feature images to a second convolution layer and a second activation function layer to obtain a second weight feature image; multiplying the second weight feature map with the deep feature map element by element to obtain a weighted feature map; the deep feature map comprises a deep feature map transformed by a convolution module and a deep feature map transformed by a preset conversion module, and the weighted feature map comprises a first weighted feature map and a second weighted feature map.

In some embodiments, the output module 406 is specifically configured to perform element-by-element addition according to the first weighted feature map and the second weighted feature map, so as to obtain an output result.

In some embodiments, the extracting module 401 is specifically configured to extract the features of the object picture by using a preset deep convolutional neural network model according to the backbone network of the pre-trained target detection network as an initialization parameter, so as to obtain an original feature map.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.

Fig. 5 is a schematic diagram of an electronic device 5 provided in an embodiment of the present application. As shown in fig. 5, the electronic apparatus 5 of this embodiment includes: a processor 501, a memory 502 and a computer program 503 stored in the memory 502 and executable on the processor 501. The steps of the various method embodiments described above are implemented by processor 501 when executing computer program 503. Alternatively, the processor 501, when executing the computer program 503, performs the functions of the modules/units in the above-described apparatus embodiments.

The electronic device 5 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 5 may include, but is not limited to, a processor 501 and a memory 502. It will be appreciated by those skilled in the art that fig. 5 is merely an example of the electronic device 5 and is not limiting of the electronic device 5 and may include more or fewer components than shown, or different components.

The processor 501 may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

The memory 502 may be an internal storage unit of the electronic device 5, for example, a hard disk or a memory of the electronic device 5. The memory 502 may also be an external storage device of the electronic device 5, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device 5. Memory 502 may also include both internal storage units and external storage devices of electronic device 5. The memory 502 is used to store computer programs and other programs and data required by the electronic device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the flow in the methods of the above embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program may implement the steps of the respective method embodiments described above when executed by a processor. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. The crowd counting model training method based on the adaptive region selection module is characterized by comprising the following steps of:

extracting the characteristics of the object pictures in the training set by utilizing a characteristic extraction network to obtain an original characteristic picture;

inputting the original feature map to a convolutional neural network branch and a conversion branch respectively;

in the branches of the convolutional neural network, the original feature map is transformed by a convolutional module, and a deep feature map transformed by the convolutional module is obtained;

in the conversion branch, the original feature map is converted by a preset conversion module, so that a deep feature map converted by the preset conversion module is obtained;

Respectively inputting the deep feature map transformed by the convolution module and the deep feature map transformed by the preset conversion module into an adaptive region selection module of a corresponding branch for adaptive processing to obtain a first weighted feature map and a second weighted feature map;

obtaining an output result according to the first weighted feature map and the second weighted feature map;

and calculating a loss function according to the output result and the real label, and reversely updating network parameters by using the loss function to obtain a trained crowd counting model.

2. The method of claim 1, wherein transforming the original feature map with a convolution module to obtain a deep feature map transformed with the convolution module comprises:

and carrying out convolution operation and pooling operation on the original feature map by utilizing a plurality of residual modules which are continuously stacked in the convolution module to obtain the deep feature map transformed by the convolution module.

3. The method according to claim 1, wherein transforming the original feature map by using a preset transformation module to obtain a deep feature map transformed by the preset transformation module comprises:

And establishing a global dependency relationship of the original feature map according to a plurality of original converters which are continuously stacked in the preset conversion module to obtain the deep feature map transformed by the preset conversion module.

4. The method of claim 1, wherein the inputting the deep feature map transformed by the convolution module and the deep feature map transformed by the preset conversion module into the adaptive region selection module of the corresponding branch respectively, and performing adaptive processing to obtain a first weighted feature map and a second weighted feature map, includes:

in the convolutional neural network branch, according to an adaptive region selection module of the convolutional neural network branch, performing adaptive processing on the deep feature map transformed by the convolutional module to obtain the first weighted feature map;

and in the conversion branch, carrying out self-adaptive processing on the deep feature map transformed by the preset conversion module according to a self-adaptive region selection module of the conversion branch to obtain the second weighted feature map.

5. The method of claim 4, wherein the performing the adaptive processing to obtain the first weighted feature map and the second weighted feature map comprises:

In the self-adaptive region selection module of the corresponding branch, receiving the corresponding deep feature map and carrying out local maximum pooling treatment to obtain a treated feature map;

inputting the processed feature map to a deconvolution layer, and carrying out deconvolution processing to obtain a deconvolution processed feature map;

sequentially inputting the feature map subjected to deconvolution to a first convolution layer and a first activation function layer to obtain a first weight feature map;

multiplying the feature map subjected to deconvolution processing by the first weight feature map element by element to obtain an enhanced feature map;

sequentially inputting the enhanced feature images to a second convolution layer and a second activation function layer to obtain a second weight feature image;

multiplying the second weight feature map with the deep feature map element by element to obtain a weighted feature map;

the deep feature map comprises the deep feature map transformed by the convolution module and the deep feature map transformed by the preset conversion module, and the weighted feature map comprises a first weighted feature map and a second weighted feature map.

6. The method of claim 1, wherein the obtaining an output result from the first weighted feature map and the second weighted feature map comprises:

And carrying out element-by-element addition according to the first weighted feature map and the second weighted feature map to obtain the output result.

7. The method according to claim 1, wherein extracting features of the object pictures in the training set by using the feature extraction network to obtain the original feature map comprises:

and extracting the characteristics of the object picture by using a preset depth convolution neural network model according to the backbone network of the pre-trained target detection network as an initialization parameter to obtain the original characteristic map.

8. Crowd count model trainer based on self-adaptation regional selection module, characterized in that includes:

the extraction module is configured to extract the characteristics of the object pictures in the training set by utilizing the characteristic extraction network to obtain an original characteristic diagram;

the input module is configured to input the original feature map to a convolutional neural network branch and a conversion branch respectively;

the first transformation module is configured to transform the original feature map by using the convolution module in the convolution neural network branch to obtain a deep feature map transformed by the convolution module;

the second transformation module is configured to transform the original feature map by using a preset transformation module in the transformation branch to obtain a deep feature map transformed by the preset transformation module;

The weighting module is configured to input the deep feature map transformed by the convolution module and the deep feature map transformed by the preset conversion module into the adaptive region selection module of the corresponding branch respectively, and perform adaptive processing to obtain a first weighted feature map and a second weighted feature map;

the output module is configured to obtain an output result according to the first weighted feature map and the second weighted feature map;

and the updating module is configured to calculate a loss function according to the output result and the real label, and reversely update network parameters by using the loss function to obtain a trained crowd counting model.

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.