CN111783935A

CN111783935A - Convolutional neural network construction method, device, equipment and medium

Info

Publication number: CN111783935A
Application number: CN202010414618.8A
Authority: CN
Inventors: 夏春龙
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Beijing Megvii Technology Co Ltd
Priority date: 2020-05-15
Filing date: 2020-05-15
Publication date: 2020-10-16

Abstract

The embodiment of the invention provides a method, a device, equipment and a medium for constructing a convolutional neural network, wherein the method comprises the following steps: determining an output end of a convolution module from an original convolution neural network, wherein the convolution module comprises a plurality of convolution layers, a direct connection branch is arranged between an input end and an output end of the convolution module, and an input of the convolution module is shared with an input of the direct connection branch; and adding a global attention module between the output end of the convolution module and the output end of the direct connection branch to obtain a target convolution neural network, wherein the global attention module is used for outputting a global attention feature map, and the sum of the output of the global attention module and the output of the direct connection branch is the input of the next convolution module.

Description

Convolutional neural network construction method, device, equipment and medium

Technical Field

The invention relates to the technical field of deep learning, in particular to a method, a device, equipment and a medium for constructing a convolutional neural network.

Background

Image recognition is a basic task in the field of computational vision that is capable of identifying or verifying the identity, attributes or class of a target subject in an image. The existing image recognition method is mainly a learnable feature method represented by a neural network, and the neural network is widely applied to an image recognition task due to the strong, manually-fine-designed and self-adaptive feature expression capability of the neural network.

In practice, in order to improve the recognition efficiency, a convolutional neural network is generally used for image recognition. Convolutional neural networks include a variety of framework models such as resnet, resnext, mobilenet, shufflent, vgg, and googlelent, among others. The model can enlarge the receptive field of the model through layer-by-layer accumulation, and saves computing power and storage resources, but the global information is lost, and a high-efficiency attention mechanism is not available, so that the information extracted in the identification process is not effective key information.

The related technology proposes that a local convolutional neural network is adopted, different convolutional kernels are used for different areas in a characteristic diagram, the precision of the convolutional neural network is improved to a certain extent, the relation among different spatial position characteristics is obtained, the size of the area cannot be determined, frequent cutting and splicing operations need to be carried out on the characteristic diagram, the calculated amount is large, and the efficiency is low. In conclusion, the convolutional neural network provided in the related art has the problem that the attention mechanism is not efficient enough.

Disclosure of Invention

In view of the above, a convolutional neural network construction method, apparatus, system, device, and medium according to embodiments of the present invention are proposed to overcome or at least partially solve the above problems.

In order to solve the above problem, a first aspect of the present invention discloses a convolutional neural network construction method, including:

determining an output end of a convolution module from an original convolution neural network, wherein the convolution module comprises a plurality of convolution layers, a direct connection branch is arranged between an input end and an output end of the convolution module, and an input of the convolution module is shared with an input of the direct connection branch;

and adding a global attention module between the output end of the convolution module and the output end of the direct connection branch to obtain a target convolution neural network, wherein the global attention module is used for outputting a global attention feature map, and the sum of the output of the global attention module and the output of the direct connection branch is the input of the next convolution module.

Optionally, the global attention module comprises: a weight value generation submodule, a combination submodule and a feature map generation submodule;

the weight generation submodule is used for processing the feature diagram input into the global attention module from the dimension of a space position and the dimension of a channel to generate weights of a plurality of channels and weights of a plurality of space positions;

the joint submodule is used for processing the weights of the channels and the weights of the spatial positions and outputting a global attention weight;

and the feature map generation submodule is used for processing the feature map input into the global attention module according to the global attention weight value to generate the global attention feature map.

Optionally, the weight value generation sub-module includes: a channel attention unit and a spatial attention unit;

the channel attention unit is used for processing the feature map input into the global attention module from a channel dimension so as to output weights of a plurality of channels;

the spatial attention unit is used for processing the feature map input into the global attention module from the spatial position dimension to output a plurality of spatial position weights.

Optionally, the channel attention unit comprises: the device comprises a first adjusting subunit, a pooling subunit, a second adjusting subunit and a weight value generating subunit;

the first adjusting subunit is configured to process the feature map input to the global attention module to obtain a first vector;

the pooling subunit is configured to perform pooling processing on the first tensor to obtain a second tensor;

the second adjustment subunit is configured to adjust the second tensor to obtain a third tensor;

and the weight generation subunit is used for processing the third tensor to generate a channel weight tensor.

Optionally, the spatial attention unit comprises: the pooling subunit, the third adjusting subunit and the weight generating subunit;

the pooling subunit is configured to process the feature map input to the global attention module to obtain a fourth tensor;

the third adjusting subunit is configured to adjust the fourth tensor to generate an adjusted fifth tensor;

and the weight generation subunit is used for processing the fifth tensor to generate a spatial position weight tensor.

Optionally, the pooling sub-unit is a convolution unit with a preset convolution size; the weight generation subunit includes: a full connection layer and a Sigmoid function layer connected in sequence.

Optionally, the method further comprises: and training the target convolutional neural network by taking the sample image set as a training sample to obtain an image recognition model for image recognition.

Optionally, training the target convolutional neural network by using a sample image set as a training sample to obtain an image recognition model for performing image recognition, including:

training the target convolutional neural network by taking a sample image set as a training sample;

in the training process, obtaining a plurality of candidate image recognition models which are trained for different times;

and screening a model meeting a preset test condition from the candidate image recognition models to obtain an image recognition model for image recognition.

Optionally, after obtaining an image recognition model for image recognition, the method includes:

obtaining an image to be identified;

extracting the features of the image to be recognized to obtain a feature map of the image to be recognized;

and inputting the characteristic diagram of the image to be recognized into an image recognition model to obtain an image recognition result.

A second aspect of the invention discloses a convolutional neural network construction apparatus, the apparatus comprising:

the convolution module comprises a plurality of convolution layers, a direct connection branch is arranged between the input end and the output end of the convolution module, and the input of the convolution module is shared with the input of the direct connection branch;

and the adding module is used for adding a global attention module between the output end of the convolution module and the output end of the direct connection branch to obtain a target convolution neural network, wherein the global attention module is used for outputting a global attention feature map, and the sum of the output of the global attention module and the output of the direct connection branch is the input of the next convolution module.

In a third aspect of the embodiments of the present invention, an electronic device is further disclosed, including:

one or more processors; and

one or more machine readable media having instructions stored thereon which, when executed by the one or more processors, cause the apparatus to perform a convolutional neural network construction method as described in embodiments of the first aspect of the invention.

In a fourth aspect of the embodiments of the present invention, a computer-readable storage medium is further disclosed, which stores a computer program for causing a processor to execute the convolutional neural network construction method according to the embodiments of the first aspect of the present invention.

The embodiment of the invention has the following advantages:

in the embodiment of the invention, a global attention module is added between the output end of a convolution module and the output end of a direct connection branch in an original convolution neural network, wherein the global attention module can be used for outputting a global attention feature map, and the sum of the global attention feature map and the output of the direct connection branch can be used as the input of a next convolution module, so that a target convolution neural network is constructed and can be used for processing images.

The global attention module can output a global attention feature map, and the global attention feature map can reflect the global importance of each feature map output by the convolution module, such as the importance on channels and spaces, so that the finally obtained target convolution neural network can dynamically learn the importance degrees of different channels and different space positions, frequent cutting and splicing operations on the feature maps are avoided, the calculation amount is reduced, and the attention mechanism is optimized. And the sum of the global attention feature map and the output of the direct connection branch is used as the input of the next convolution module, so that global information with higher precision is extracted integrally, and the precision of image recognition is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic diagram of a raw convolutional neural network according to an embodiment of the present invention;

FIG. 2 is a flow chart of steps of a convolutional neural network construction method according to an embodiment of the present invention;

FIG. 3 is a block diagram of a global attention module according to an embodiment of the invention;

FIG. 4 is a schematic spatial diagram of a feature diagram when N is 1 in (N, C, H, W) according to an embodiment of the present invention;

FIG. 5 is a block diagram of another global attention module according to an embodiment of the invention;

FIG. 6 is a block diagram of the global attention module in an example of the present invention;

FIG. 7 is a schematic diagram showing the structure of the global attention module shown in FIG. 3 or FIG. 5 after being added to the original convolutional neural network shown in FIG. 1;

fig. 8 is a block diagram of a convolutional neural network constructing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below to clearly and completely describe the technical solutions in the embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In view of the problems of frequent cutting and splicing operations on a characteristic diagram, large calculation amount and low efficiency of various convolutional neural network models in the related technology, the applicant provides a convolutional neural network construction method, and the method is mainly characterized in that a global attention module is added at the output end of a convolutional module in an original convolutional neural network model to independently learn the importance of different channels and different spatial positions, so that the attention mechanism is optimized, and the image recognition efficiency is improved.

The following describes a convolutional neural network construction method according to the present invention in detail. It should be noted that: the convolutional neural network construction method provided by the invention can be applied to terminal equipment or a server.

Referring to fig. 1, a network structure diagram of an original neural network to be processed according to the present embodiment is shown, where the original neural network shown in fig. 1 is ResNet18, and the ResNet18 network includes a full connection layer and a plurality of convolution modules.

A convolutional neural network construction method according to the present embodiment is described with reference to the convolutional neural network shown in fig. 1.

Referring to fig. 2, a flowchart illustrating steps of a convolutional neural network construction method according to this embodiment is shown, and as shown in fig. 2, the method may specifically include the following steps:

step S201: and determining the output end of the convolution module in the original convolution neural network.

The convolution module comprises a plurality of convolution layers, a direct connection branch is arranged between the input end and the output end of the convolution module, and the input of the convolution module is shared with the input of the direct connection branch.

Generally, the original convolutional neural network may include a plurality of convolutional modules, and of course, the original convolutional neural network may include a pooling layer, a full-link layer, and the like in addition to the plurality of convolutional modules. The pooling layer is used for reserving main characteristics and reducing parameters and calculation amount, and the obtained characteristic information can be extracted and integrated by the full-connection layer. Each convolution module can be used for performing convolution processing on the feature map output by the previous convolution module to obtain local features, and then outputting the feature map after convolution processing to the next convolution module.

The direct connection branch between the input end and the output end of the convolution module can be understood as directly outputting the feature diagram input to the convolution module, or outputting the feature diagram input to the convolution module after down sampling, and then taking the sum of the output of the direct connection branch and the output of the convolution module as the input of the next convolution module.

Also shown in fig. 1, marked in the dashed box in fig. 1 is a convolution module, and it can be seen that ResNet18 includes 8 convolution modules in total. Each convolution module can comprise two convolution layers, each convolution module is provided with a direct connection branch at an input end and an output end, and the direct connection branches can directly output the feature map input to the convolution module or output the feature map after down sampling.

In this embodiment, the output ends of all convolution modules included in the original convolutional neural network, which are the ends of the convolution modules used for outputting the feature map, may be determined.

Step S202: and adding a global attention module between the output end of the convolution module and the output end of the direct connection branch to obtain a target convolution neural network.

The global attention module is used for outputting a global attention feature map, and the sum of the output of the global attention module and the output of the direct connection branch is the input of the next convolution module.

In this embodiment, a global attention module may be added between the output end of each convolution module and the output end of the direct connection branch corresponding to the convolution module, and a sum of the feature map output by the global attention module and the feature map output by the direct connection branch may be used as an input of a next convolution module.

The global attention module is connected to the output end of the convolution module, so that the feature map output by the convolution module can be input into the global attention module, and the global attention feature map is obtained through the global attention module. The global attention module may be specifically configured to perform attention redistribution on the channels and the spatial positions of the input feature map, that is, adjust importance of the feature map on the channels and the spatial positions to output a global attention feature map, where the output global attention feature map may reflect more comprehensive global feature information.

As shown in fig. 1, the portion enclosed by the dashed box is a convolution module 101, and the global attention module according to the embodiment of the present invention may be connected between the output of the convolution module 101 and the output of the direct connection branch 102. Specifically, the global attention module may be located at the position shown by the dashed arrow 103 in fig. 1, i.e. the output of the convolution module 101 is directly input to the global attention module, and the output of the global attention module is directly connected to the output of the branch as the input of the next convolution module 104.

In practice, a global attention module may be added between the output of each convolution module in the original convolutional neural network and the output of the corresponding direct-connected branch.

Referring to fig. 3, a schematic structural diagram of a global attention module in an embodiment is shown, and as shown in fig. 3, the global attention module 300 may include: the weight generation submodule, the combination submodule and the feature map generation submodule.

The sub-modules of the global attention module 300 according to this embodiment are described in detail below with reference to fig. 3:

the weight generation submodule can be used for processing the feature map input into the global attention module from the spatial position dimension and the channel dimension to generate weights of a plurality of channels and weights of a plurality of spatial positions;

the joint submodule can be used for processing the weights of the channels and the weights of the spatial positions and outputting a global attention weight;

In this embodiment, the weight value generation sub-module may generate a corresponding weight value for each channel in the input feature map, and generate a corresponding weight value for different spatial positions, where the spatial position may be understood as a local region of the feature map in the spatial dimension. In this embodiment, the weight corresponding to each channel may be used to represent the importance of the feature map output by the convolution module in different channels, and the weights corresponding to different spatial positions may be used to represent the importance of the feature map output by the convolution module in different spatial positions.

In a specific implementation, when generating weights corresponding to a plurality of channels, the influence of the spatial position may be eliminated first, and when generating weights of a plurality of spatial positions, the influence of the plurality of channels may be eliminated first.

As shown in fig. 3, the feature map input to the global attention module is (N, C, H, W), where N denotes the number of images input to the original convolutional neural network, C denotes the number of channels, H denotes the height of the feature map, and W denotes the width of the feature map.

Referring to fig. 4, a space diagram of a feature diagram when N is 1 in (N, C, H, W) is shown. As shown in fig. 4, where (H, W) can be understood as the size of the feature map in space, specifically characterizing the distribution of features in space, for example, a cell of the C1 channel in fig. 4 represents a spatial location, and C1 to Cn are n channels.

The weight generation submodule outputs weights (N, C,1,1) of a plurality of channels and weights (N,1, H, W) of a plurality of spatial positions, where (N, C,1,1) may also be referred to as a channel weight tensor, and (N,1, H, W) may also be referred to as a spatial position weight tensor. Then (N, C,1,1) can be understood as the weight of each channel in the feature map (N, C, H, W), and (N,1, H, W) can be understood as the weight tensor corresponding to the different spatial locations in (N, C, H, W).

After the respective weights of the multiple channels and the respective weights of the multiple spatial positions are obtained, the respective weights of the multiple channels and the respective weights of the multiple spatial positions can be input to a combining submodule, and the combining submodule can perform point multiplication on the respective weights of the multiple channels and the respective weights of the multiple spatial positions, so that a global attention weight is output.

As shown in fig. 3, the weight generation submodule outputs the weight tensors (N, C,1,1) of the plurality of channels and the weight tensors (N,1, H, W) of the plurality of spatial positions, and then (N, C,1,1) and (N,1, H, W) may be multiplied by each other to obtain the global attention weight (N, C, H, W)'. The global attention weight includes the respective weight of the plurality of channels and the respective weight of the plurality of spatial locations.

After the global attention weight is obtained, the importance of the feature map input to the global attention module on different channels and different spatial positions is obtained, and the feature map generation sub-module can perform point multiplication on the feature map input to the global attention module and the global attention weight, so that the global attention feature map is obtained. Therefore, the global attention feature map can be a feature map obtained by redistributing the attention of the feature map output by the convolution processing module, and global information with higher precision is refined.

In a specific implementation, the weight generation submodule needs to generate weights of a plurality of channels and weights of a plurality of spatial positions, and the weight generation submodule may specifically include: a channel attention unit and a spatial attention unit.

The channel attention unit is used for processing the feature map input into the global attention module from a channel dimension to output weight values of a plurality of channels.

Specifically, the channel attention unit may dynamically adapt the pooling operation spatially, and specifically, may generate a corresponding weight for each pixel with a convolution of conv1x1, and then weight-sum. And finally, obtaining weights of different channels through a full connection layer FC and a Sigmoid function layer.

The spatial attention unit is used for processing the feature map input into the global attention module from a spatial position dimension to output a weight value of a plurality of spatial positions.

Specifically, the spatial attention unit may adaptively pool on the channels, specifically, a convolution of conv1x1 may be used to generate different weights for each channel, and then the weights are summed, and finally the weights of different spatial positions are obtained through an FC and Sigmoid layer.

Referring to fig. 5, a schematic structural diagram of a global attention module in an embodiment is shown, and as shown in fig. 5, the global attention module 400 may completely include a weight value generation submodule, a joint submodule, and a feature map generation submodule, and the weight value generation submodule may include a channel attention unit 401 and a spatial attention unit 402.

The channel attention unit 401 and the spatial attention unit 402 will be described in detail with reference to fig. 5:

first, the channel attention unit includes: the device comprises a first adjusting subunit, a pooling subunit, a second adjusting subunit and a weight value generating subunit. The method comprises the following specific steps:

the first adjusting subunit is configured to process the feature map input to the global attention module to obtain a first sheet quantity; the pooling subunit is configured to perform pooling processing on the first tensor to obtain a second tensor; the second adjustment subunit is configured to adjust the second tensor to obtain a third tensor; and the weight generation subunit is used for processing the third tensor to generate a channel weight tensor.

The first tensor is understood to be a tensor corresponding to the input feature map on the channel, and reflects different channel information. For example, if the input feature map is an (N, C, H, W) tensor, the input feature map may become a first tensor of (N, HW, C,1) after the conversion, and the specific conversion process may refer to the correlation technique.

Then, pooling the first tensor, which may be average pooling, maximum pooling, or random pooling, to reduce the number of parameters in the spatial dimension, results in a second tensor that reflects the channel information and the pooled spatial information. For example, the first tensor is (N, HW, C,1), and the second tensor (N,1, C,1) is obtained by maximally pooling the first tensor with (N, HW, C, 1).

Finally, the second tensor can be input to the weight generation subunit, so that the channel weight tensor is obtained by the weight generation subunit, for example, the second tensor (N,1, C,1) is input to the weight generation subunit, so that the channel weight tensor (N, C) is generated.

In this embodiment, the pooling sub-unit may be a convolution unit of a preset convolution size; the weight generation subunit includes: a full connection layer and a Sigmoid function layer connected in sequence. The preset convolution size may be the size conv1 × 1, or the size convH × W, where H is the height of the input feature map, and W is the width of the input feature map.

Secondly, as shown in fig. 5, the spatial attention unit may specifically include: a pooling subunit, a third adjusting subunit and a weight generating subunit, specifically:

the pooling subunit is configured to process the feature map input to the global attention module to obtain a fourth tensor; the third adjusting subunit is configured to adjust the fourth tensor to generate an adjusted fifth tensor; and the weight generation subunit is used for processing the fifth tensor to generate a spatial position weight tensor.

In this embodiment, the pooling subunit may be configured to perform maximum pooling, random pooling, or average pooling on each channel of the feature map, so as to obtain a fourth tensor, where the fourth tensor may be a tensor corresponding to a spatial position of the input feature map, and reflects information of a different spatial position. For example, if the input feature map is an (N, C, H, W) tensor, the input feature map may be a fourth tensor of (N,1, H, W) after the conversion, and the specific conversion process may refer to the related art.

In this embodiment, the third adjusting unit adjusts the fourth tensor, specifically, may convert the fourth tensor into a fifth tensor, for example, convert the fourth tensor (N,1, H, W) reshape into (N, HW). HW is used to determine the weights of pixels at a spatial location.

In this embodiment, the fifth tensor can be input to the weight generation subunit, so that the weight of the spatial position is obtained by the weight generation subunit, for example, the fifth tensor (N, HW) is input to the weight generation subunit, so that the weight of the spatial position (N, HW)' is generated. In this way, the spatial location weight tensor can characterize the importance of each spatial location.

In this embodiment, the pooling sub-unit may be a convolution unit of a preset convolution size; the weight value generation subunit may include: a full connection layer and a Sigmoid function layer connected in sequence. The preset convolution size may be the size conv1 × 1, or the size convH × W, where H is the height of the input feature map, and W is the width of the input feature map.

Referring to fig. 6, a schematic structural diagram of the global attention module in an example of the present invention is shown, as shown in fig. 6, where the module in the left dashed box is a channel attention unit, and the module in the right solid box is a spatial attention unit, where the characteristic input to the channel attention unit is (N, HW, C,1), and the characteristic input to the spatial attention unit is (N, C, H, W). The channel attention unit and the spatial attention unit may each include conv1 × 1 convolutional layers, full connection layers FC, and Sigmoid function layers, the conv1 × 1 convolutional layers being used to pool input features.

As can be seen from fig. 6, the channel attention unit output is characterized by (N, C,1,1), and the spatial attention unit output is characterized by (N,1, H, W), so that after (N, C,1,1) and (N,1, H, W) are dot-multiplied, the dot-multiplied result can be dot-multiplied with the feature map (N, C, H, W) input to the global attention module, and a fused feature map is obtained.

Referring to fig. 7, a schematic diagram of a network structure of a target convolutional neural network is shown after adding the global attention module shown in fig. 3 or 5 to the Resnet18 shown in fig. 1. Wherein shortcut is a direct connection branch.

The target convolutional neural network is obtained by adding the global attention module in the convolutional module, and the hyper-parameter setting of the target convolutional neural network can be consistent with that of the original convolutional neural network, so that the target convolutional neural network can inherit the hyper-parameter of the original neural network, and the target convolutional neural network can be directly used as an initial image recognition model to process an image. Namely, the obtained target convolutional neural network inherits the hyper-parameters of the original neural network, so that the target convolutional neural network can be directly used as an image recognition model to process images.

In one embodiment, after the target convolutional neural network is obtained, the target convolutional neural network may be further refined to improve the image processing efficiency of the target convolutional neural network. Correspondingly, the target convolutional neural network can be trained by taking the sample image set as a training sample, so as to obtain an image recognition model for image recognition.

In this embodiment, when the target convolutional neural network is trained, the set hyper-parameters may be consistent with the original convolutional neural network.

The sample image set can include a plurality of sample images aiming at the same image recognition task, and each sample image can carry a label or not according to actual training requirements.

The image recognition task can be a face image recognition task, an image classification task, an attribute recognition task, a fingerprint image recognition task, an iris image recognition task and the like. Correspondingly, the sample image set can comprise a plurality of face images from different faces or the same face aiming at the face image recognition task; for the attribute identification task, the sample image set can comprise a plurality of sample images with different attributes; for a fingerprint image recognition task, a sample image set may include multiple fingerprint images from different fingers or the same finger; for iris image recognition tasks, the sample image set may include multiple iris images with different eyes or the same eye.

In this embodiment, for different image recognition tasks, the target convolutional neural network may be trained according to a corresponding correlation technique to obtain an image recognition model, where a structure of the obtained image recognition model is consistent with a structure of the target convolutional neural network.

In one specific implementation, when the target convolutional neural network is trained by using a sample image set as a training sample, the target convolutional neural network at the end of training may be determined as an image recognition model for performing image recognition.

In practice, when the accuracy of image recognition reaches a preset accuracy, the training is considered to be finished, and then the target convolutional neural network at the moment is determined as an image recognition model.

In another specific implementation, when a sample image set is used as a training sample to train the target convolutional neural network, the sample image set may be used as the training sample to train the target convolutional neural network, in the training process, a plurality of candidate image recognition models subjected to different training times are obtained, and finally, a model meeting a preset test condition is screened from the plurality of candidate image recognition models to obtain an image recognition model for image recognition.

In this specific implementation, a plurality of image samples in the sample image set may be input to the target convolutional neural network in batches for training, for example, when the target convolutional neural network is input to be trained in 100 batches, the target convolutional neural network is trained for 100 times.

In practice, the target convolutional neural network at the end of each training may be saved, for example, 100 target convolutional neural networks are saved for 100 training. Or, after a plurality of training times are preset, the target convolutional neural networks at the end of each subsequent training may be stored, for example, if the storage is started after 50 times, 50 target convolutional neural networks may be stored. Alternatively, the target convolutional neural network at the end of every N training sessions may be saved, for example, 10 target convolutional neural networks may be saved every 10 training sessions.

The target convolutional neural network stored each time can be used as a candidate image recognition model, and then a plurality of candidate image recognition models are obtained.

After obtaining the plurality of candidate image recognition models, the plurality of candidate image recognition models may be tested with the test sample to obtain test results output by each of the plurality of candidate image recognition models. According to the test result, the accuracy of image identification can be determined, the candidate image identification model with the highest accuracy can be screened out from the candidate image identification models, and the candidate image identification model with the highest accuracy is determined as the image identification model. However, the embodiment of the present invention is not limited to this, or alternatively, the target convolutional neural network obtained after N times of iterative training may be determined as the image recognition model, where N is a positive integer, and a specific value thereof may be set according to an actual application situation.

With the above embodiment, after obtaining the image recognition model, the image recognition model may be used for image recognition, and specifically, when performing image recognition by using the image recognition model, the method may specifically include the following steps:

step S203: and obtaining an image to be identified.

According to the image recognition task, the image to be recognized can be a face image, a fingerprint image or an image shot aiming at a specific object.

Step S204: and performing feature extraction on the image to be recognized to obtain a feature map of the image to be recognized.

In this embodiment, feature extraction may be performed on the image to be recognized, specifically, feature encoding may be performed on the image to be recognized, so as to mathematically quantize information in the image to be recognized, thereby obtaining a feature map of the image to be recognized.

Step S205: and inputting the characteristic diagram of the image to be recognized into an image recognition model to obtain an image recognition result.

In this embodiment, the feature map of the image to be recognized may be input to an input end of the image recognition model, and the image recognition model may perform pooling processing, convolution processing, and the like on the feature map of the image to be recognized, where the global attention module of each convolution module in the image recognition model may output the global attention feature map, so that the finally obtained image recognition model may dynamically learn the importance degrees of different channels and different spatial positions, avoid frequent cutting and splicing operations on the feature map, reduce the amount of computation, and optimize the attention mechanism.

It should be noted that: after the target convolutional neural network of the embodiment of the application is obtained, the structure of the target convolutional neural network can be improved continuously, so that the efficiency and the accuracy of image processing of the target convolutional neural network are improved. For example, a convolution module of the target convolutional neural network may be replaced by a multi-scale sensing module, and the multi-scale sensing unit is configured to output a fused feature map of feature maps of multiple scales according to the feature map input to the layer, specifically, the feature maps of different receptive fields may share part of features by sharing a convolution kernel, so as to further improve the accuracy of feature extraction.

Of course, in practice, the convolution module of the target convolutional neural network may be replaced by a receptive field adaptive module, and the sum of the output of each receptive field adaptive module and the output of the direct connection branch may be used as the input of the next receptive field adaptive module. The receptive field self-adaptive module is used for respectively generating corresponding weight values for multiple receptive fields so as to process the characteristic graphs of the multiple receptive fields, and can avoid unreasonable artificial design of the weights of different receptive fields and improve the accuracy of characteristic extraction.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Based on the same inventive concept, referring to fig. 8, a schematic frame diagram of a convolutional neural network building apparatus according to an embodiment of the present invention is shown, and as shown in fig. 8, the convolutional neural network building apparatus may specifically include the following modules:

a determining module 801, configured to determine an output end of a convolution module from an original convolutional neural network, where the convolution module includes multiple convolution layers, and a direct branch is provided between an input end and an output end of the convolution module, and an input of the convolution module is shared with an input of the direct branch;

an adding module 802, configured to add a global attention module between an output of the convolution module and an output of the direct connection branch to obtain a target convolution neural network, where the global attention module is configured to output a global attention feature map, and a sum of an output of the global attention module and an output of the direct connection branch is an input of a next convolution module.

Optionally, the apparatus may further include a training module, configured to train the target convolutional neural network by using a sample image set as a training sample, so as to obtain an image recognition model for performing image recognition.

Optionally, the training module may specifically include the following units:

the training unit is used for training the target convolutional neural network by taking a sample image set as a training sample;

the storage unit is used for obtaining a plurality of candidate image recognition models which are trained for different times in the training process;

and the screening unit is used for screening a model meeting a preset test condition from the candidate image recognition models to obtain an image recognition model for image recognition.

Optionally, the apparatus may comprise the following modules:

the image obtaining module is used for obtaining an image to be identified;

the characteristic extraction module is used for extracting the characteristics of the image to be identified to obtain a characteristic diagram of the image to be identified;

and the image input module is used for inputting the characteristic diagram of the image to be recognized into an image recognition model to obtain an image recognition result.

Embodiments of the present invention further provide an electronic device, which may be configured to execute the convolutional neural network construction method and may include a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor is configured to execute the convolutional neural network construction method.

Embodiments of the present invention further provide a computer-readable storage medium storing a computer program for causing a processor to execute the convolutional neural network construction method according to the embodiments of the present invention.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The method, the apparatus, the device and the storage medium for constructing the convolutional neural network provided by the present invention are described in detail above, and a specific example is applied in the present disclosure to illustrate the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A convolutional neural network construction method, the method comprising:

2. The method of claim 1, wherein the global attention module comprises: a weight value generation submodule, a combination submodule and a feature map generation submodule;

3. The method of claim 2, wherein the weight generation submodule comprises: a channel attention unit and a spatial attention unit;

4. The method of claim 3, wherein the channel attention unit comprises: the device comprises a first adjusting subunit, a pooling subunit, a second adjusting subunit and a weight value generating subunit;

5. The method of claim 3, wherein the spatial attention unit comprises: the pooling subunit, the third adjusting subunit and the weight generating subunit;

6. The method according to claim 4 or 5, wherein the pooling sub-unit is a convolution unit of a preset convolution size; the weight generation subunit includes: a full connection layer and a Sigmoid function layer connected in sequence.

7. The method of any of claims 1-5, further comprising: and training the target convolutional neural network by taking the sample image set as a training sample to obtain an image recognition model for image recognition.

8. The method of claim 7, wherein training the target convolutional neural network with a sample image set as a training sample to obtain an image recognition model for image recognition, comprises:

9. The method of claim 7, wherein after obtaining an image recognition model for image recognition, the method comprises:

obtaining an image to be identified;

10. An apparatus for convolutional neural network construction, the apparatus comprising:

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing implementing the convolutional neural network construction method of any one of claims 1-9.

12. A computer-readable storage medium storing a computer program for causing a processor to execute the convolutional neural network construction method as claimed in any one of claims 1 to 9.