CN115311518A

CN115311518A - Method, device, medium and electronic equipment for acquiring visual attribute information

Info

Publication number: CN115311518A
Application number: CN202210963972.5A
Authority: CN
Inventors: 高�浩
Original assignee: Ainnovation Nanjing Technology Co ltd
Current assignee: Ainnovation Nanjing Technology Co ltd
Priority date: 2022-08-11
Filing date: 2022-08-11
Publication date: 2022-11-08

Abstract

The embodiment of the application provides a method, a device, a medium and an electronic device for acquiring visual attribute information, wherein the method comprises the following steps: acquiring an image to be identified; inputting the image to be recognized into a target attribute information extraction model, and obtaining at least one visual attribute information through the target attribute information extraction model, wherein the target attribute information extraction model comprises: the image recognition method comprises a first feature extraction network, a second feature extraction network and an output network, wherein the first feature extraction network adopts depth separable convolution to perform feature extraction on the image to be recognized to obtain three feature maps, the second feature extraction network performs feature extraction on the three feature maps again respectively to obtain three target feature maps, and the output network is used for outputting the three target feature maps. By adopting the technical scheme of the embodiment of the application, the model detection speed is ensured, and meanwhile, the accuracy is higher.

Description

Method, device, medium and electronic equipment for acquiring visual attribute information

Technical Field

The present application relates to the field of deep learning, and in particular, to a method, an apparatus, a medium, and an electronic device for obtaining visual attribute information.

Background

As an era of the vigorous development of informatization, information exists in various ways, mainly including voice, text, images, video, and the like. The person itself contains much information, such as expression, gender, race, etc. The personnel attribute (as a type of visual attribute information) identification plays an important role in real life, more safety guarantee measures can be provided for workers through the identification of the personnel attribute in a factory, and more accurate service can be provided for customers through the personnel attribute identification in a market. Therefore, the method has important significance for the research of the identification of the attributes of the personnel.

The development of the deep learning technology promotes the continuous breakthrough progress of technologies such as target detection, identification, segmentation and the like in various fields. Compared with the traditional algorithm which needs complex preprocessing operation on data and needs different artificial features designed according to different tasks, the time consumption is long, the universality is poor, and the deep learning technology adopts an end-to-end method to extract the features of the data, so that the method has stronger generalization capability and high robustness.

The traditional target detection method mainly utilizes operators such as histograms and hash features, so that the overall process is more complicated and the accuracy is lower. The accuracy of the two-stage based algorithm, such as RCNN and FAST-RCNN, is high, but the two-stage algorithm is poor in real-time performance because candidate region generation is firstly needed in a feature extraction link and corresponding processing is carried out on the candidate regions.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method, an apparatus, a medium, and an electronic device for obtaining visual attribute information, where the technical scheme of the embodiments of the present application has a higher accuracy while ensuring a model detection speed.

In a first aspect, an embodiment of the present application provides a method for acquiring visual attribute information, where the method includes: acquiring an image to be identified; inputting the image to be recognized into a target attribute information extraction model, and obtaining at least one visual attribute information through the target attribute information extraction model, wherein the target attribute information extraction model comprises: the image recognition method comprises a first feature extraction network, a second feature extraction network and an output network, wherein the first feature extraction network adopts depth separable convolution to perform feature extraction on the image to be recognized to obtain three feature maps, the second feature extraction network performs feature extraction on the three feature maps again respectively to obtain three target feature maps, and the output network is used for outputting the three target feature maps.

The first feature extraction network of some embodiments of the present application uses deep separable convolution instead of normal convolution, so that the number of parameters is reduced in convolution calculation, and the purpose of light weight is achieved.

In some embodiments, the first feature extraction network comprises a MobileNetV3 network, wherein the MobileNetV3 network performs feature extraction on the image to be identified using the deep separable convolution.

Some embodiments of the present application implement deep separable convolution operations based on a MobileNetV3 network.

In some embodiments, the first feature extraction network further comprises a first pooling processing module configured to derive fixed-size feature vectors and a second pooling processing module configured to extract different-size spatial feature information.

Some embodiments of the present application use dual SPP modules (i.e., a first pooling processing module and a second pooling processing module) in a backhaul feature extraction part (i.e., in a first feature extraction network), where the first SPP mainly can effectively output a feature vector of a fixed size, so as to prevent image distortion caused by direct cropping and scaling due to an excessively large image size, and the second SPP mainly extracts spatial feature information of different sizes, thereby improving robustness of the model to spatial layout and object degeneration. Some embodiments of the application enrich the expression ability of the feature map through the SPP module operation, greatly help the situation that the size difference of the target in the detected image is large, and can further improve the precision of the detection model.

In some embodiments, the obtaining at least one visual attribute information by the target attribute information extraction model includes: inputting the image to be identified into the first pooling processing module to obtain a feature vector with a fixed size; obtaining an initial characteristic diagram according to the characteristic vector and the MobileNet V3 network; and obtaining a target feature map according to the initial feature map and the second pooling processing module, wherein the target feature map belongs to one of the three feature maps.

In some embodiments of the application, the first pooling processing module obtains processing data with a fixed size, and then the second pooling processing module obtains the target feature map, so that the universality of the technical scheme is improved.

In some embodiments, the first and second pooling modules each comprise: the image processing device comprises a first pooling sub-module, a second pooling sub-module, a third pooling sub-module and a splicing sub-module, wherein the first pooling sub-module performs pooling operation on an input image by using pooling of a first size to obtain a pooling map of a first size, the second pooling sub-module performs pooling operation on the input image by using pooling of a second size to obtain a pooling map of a second size, the third pooling sub-module performs pooling operation on the input image by using pooling of a third size to obtain a pooling map of a third size, and the splicing sub-module is configured to perform fusion on one channel on the input image, the pooling map of the first size, the pooling map of the second size and the pooling map of the third size by using a splicing function to obtain an output image of the first pooling processing module or the second pooling processing module.

The detection precision and the detection speed of the model can be further improved through the framework of the pooling processing module provided by the embodiment of the application.

In some embodiments, the pooled nuclei of the first size are 5 x 5 pooled nuclei, the pooled nuclei of the second size are 9 x 9 pooled nuclei, and the pooled nuclei of the third size are 13 x 13 pooled nuclei.

Some embodiments of the present application use convolution kernels of three sizes for transformation to improve the versatility of the solution.

In some embodiments, the first feature extraction network further comprises a CBAM attention mechanism processing module, wherein the CBAM attention mechanism processing module receives image data output by the MobileNetV3 network, and output data of the CBAM attention mechanism processing module serves as input data to the second feature extraction network.

According to the CBAM attention mechanism provided by some embodiments of the application, a space-based attention module is added on the basis of the SEnet attention mechanism, so that the accuracy of the obtained prediction result is improved.

In some embodiments, the CBAM attention mechanism processing module is configured to: and aiming at the input feature graph, taking an average value and a maximum value on a channel of each feature point to obtain two results, stacking the two results, changing the number of the channels through point convolution, obtaining the weight of each feature point of an input feature layer through a Sigmoid function, and multiplying the weight and the input feature graph to finish the operation of space attention.

Some embodiments of the present application accomplish the spatial attention operations by a CBAM attention mechanism processing module.

In some embodiments, the MobileNetV3 network comprises a first MobileNetV3 network, a second MobileNetV3 network, and a third MobileNetV3 network, the CBAM attention mechanism processing module comprises a first CBAM attention mechanism processing module, a second CBAM attention mechanism processing module, and a third CBAM attention mechanism processing module, wherein an output of the first MobileNetV3 network serves as an input to the first CBAM attention mechanism processing module; the output of the second MobileNet V3 network is used as the input of the second CBAM attention mechanism processing module; the output of the third MobileNetV3 network serves as the input of the second pooling processing module, and the output of the second pooling processing module serves as the input of the third CBAM attention mechanism processing module.

Some embodiments of the present application provide a specific architecture diagram.

In some embodiments, the second feature extraction network uses a PANet network.

In some embodiments, before the inputting the image to be recognized into the target property extraction information extraction model, the method further comprises: clustering targets in the data set by using a K-means algorithm to obtain prior frames with a plurality of target sizes; and training an attribute information extraction model according to the prior frame to obtain the target attribute information extraction model.

According to some embodiments of the application, the kmeans + + algorithm is used for clustering the target frames, so that the problem that the convergence speed of the algorithm is too low in the training process can be avoided.

In some embodiments, the K-means algorithm is a kmeans + + algorithm.

In a second aspect, some embodiments of the present application provide an apparatus for obtaining visual attribute information, the apparatus comprising: an acquisition module configured to acquire an image to be recognized; a visual attribute information obtaining module, configured to input the image to be recognized into a target attribute information extraction model, and obtain at least one piece of visual attribute information through the target attribute information extraction model, where the target attribute information extraction model includes: the image recognition method comprises a first feature extraction network, a second feature extraction network and an output network, wherein the first feature extraction network adopts depth separable convolution to perform feature extraction on the image to be recognized to obtain three feature maps, the second feature extraction network performs feature extraction on the three feature maps again respectively to obtain three target feature maps, and the output network is used for outputting the three target feature maps.

In a third aspect, some embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, may implement the method as described in any of the embodiments of the first aspect above.

In a fourth aspect, some embodiments of the present application provide an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, may implement the method according to any of the embodiments of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a flowchart of a method for training an attribute information model to obtain a target attribute information extraction model according to an embodiment of the present application;

fig. 2 is an architecture diagram of a pooling processing module provided by an embodiment of the present application;

fig. 3 is an architecture diagram of a target attribute information extraction model or an attribute information model provided in an embodiment of the present application;

fig. 4 is a flowchart of a method for acquiring visual attribute information according to an embodiment of the present application;

FIG. 5 is a block diagram illustrating an apparatus for obtaining visual attribute information according to an embodiment of the present disclosure;

fig. 6 is a schematic composition diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

In a traditional target detection algorithm, operators such as a gradient Histogram (HOG), a hash feature (Haar), a feature point extraction (ORB) and the like are generally used for feature extraction, and the traditional target detection algorithm is poor in universality and low in accuracy. Although the accuracy of the deep learning two-stage target detection algorithm is high, the network firstly needs to generate a candidate region and classifies the candidate region and performs position regression to obtain a detection result, so that the real-time effect is poor, and the actual scene requirement cannot be met. Some embodiments of the application adopt a one-stage model, meet the real-time requirements of a real scene, and simultaneously aim at the problems of accuracy and speed of an algorithm.

To meet the real-time requirement, one trade-off in speed and precision is achieved, some embodiments of the application select a one-stage-based Yolov4 algorithm for visual attribute information (e.g., personnel attribute) identification, and improve on the basis of the algorithm to improve the precision and detection speed of the model. For example, some embodiments of the present application make corresponding improvements on both Backbone networks backhaul (corresponding to the first feature extraction network) and Neck (corresponding to the second feature extraction network), so as to ensure a certain accuracy while ensuring a model detection speed.

It should be noted that, in the following embodiments of the present application, the target attribute information extraction model is obtained by training an attribute information extraction model, and the two models have the same architecture. The following exemplarily describes the training process and the architecture of the model in conjunction with fig. 1 to 3, and it should be understood that the architecture of the trained target attribute information extraction model may also adopt the architecture of fig. 2 to 3.

Referring to fig. 1, fig. 1 is a process of training an attribute information extraction model to obtain a target attribute information extraction model. It should be noted that fig. 1 illustrates the training process by taking the extraction of the person attribute information as an example, but some embodiments of the present application may also be used to extract other attribute information than the person attribute information.

As shown in fig. 1, the training process includes the following steps:

s1, firstly, a camera is used for acquiring worker data in a real scene, and certain image resolution is guaranteed during data acquisition.

And S2, obtaining video data of the traffic intersection, carrying out frame processing operation on the video stream to obtain image data, and labeling the actual image data by using labeling software such as labeling and marking as a minimum circumscribed rectangular frame of related attributes of workers, such as labeling of glasses and masks.

Labeling is Labeling software, for example, for Labeling attributes of people on an image. The method frames the mask, the glasses, the face and the gloves of a person \8230, and the attributes are framed by rectangular frames, then automatically generates a labeled file corresponding to the image, and the labeled frame size information and the labeled attribute category information exist in the file.

And S3, preprocessing the data, namely removing the over-blurred image, and dividing the data of the data set according to a certain division ratio to obtain a training set, a verification set and a test set used for training.

And S4, the processed data are sent to the attribute information extraction model to train the model, namely, the processed data are sent to the attribute information extraction model to train the model to obtain the target attribute information extraction model.

And S5, using the trained model (as an example of a target attribute information extraction model) for identifying the personnel attributes of the actual scene. That is, the person attribute information can be extracted using the obtained target attribute information extraction model.

The architecture of the attribute information extraction model or the target attribute information extraction model is exemplarily set forth below with reference to fig. 2 and 3.

In some embodiments of the present application, the target attribute information extraction model or the attribute information extraction model includes: the image recognition method comprises a first feature extraction network, a second feature extraction network and an output network, wherein the first feature extraction network adopts depth separable convolution to perform feature extraction on the image to be recognized to obtain three feature maps, the second feature extraction network performs feature extraction on the three feature maps again respectively to obtain three target feature maps, and the output network is used for outputting the three target feature maps.

For example, in some embodiments of the present application, the first feature extraction network comprises a MobileNetV3 network, wherein the MobileNetV3 network performs feature extraction on the image to be recognized using the deep separable convolution. Some embodiments of the present application implement deep separable convolution operations based on a MobileNetV3 network.

For example, in some embodiments of the present application, the first feature extraction network further comprises a first pooling processing module configured to obtain feature vectors of a fixed size, and a second pooling processing module configured to extract spatial feature information of different sizes. Some embodiments of the present application use dual SPP modules (i.e., a first pooling processing module and a second pooling processing module) in the backhaul feature extraction part (i.e., in the first feature extraction network), where the first SPP mainly can effectively output a feature vector with a fixed size, so as to prevent image distortion caused by direct cropping and scaling due to an excessively large image size, and the second SPP mainly extracts spatial feature information with different sizes, thereby improving robustness of the model to spatial layout and object degeneration. Some embodiments of the application enrich the expression capability of the feature map through the SPP module operation, greatly help the situation that the size difference of the target in the detected image is large, and can further improve the precision of the detection model.

For example, in some embodiments of the present application, the first feature extraction network further comprises a CBAM attention processing module, wherein the CBAM attention processing module receives image data output by the MobileNetV3 network, the output data of the CBAM attention processing module being input data to the second feature extraction network. According to the CBAM attention mechanism provided by some embodiments of the application, a space-based attention module is added on the basis of the SEnet attention mechanism, so that the accuracy of the obtained prediction result is improved.

As shown in fig. 3, in some embodiments of the present application, the MobileNetV3 network comprises a first MobileNetV3 network (corresponding to the first M _ bottleeck 6 of fig. 3, wherein the representation is multiplied by 6), a second MobileNetV3 network (corresponding to the second M _ bottleeck 6 of fig. 3), and a third MobileNetV3 network (corresponding to the M _ bottleeck 3 of fig. 3), the CBAM attention mechanism processing module comprises a first CBAM attention mechanism processing module (corresponding to the CBAM-1 of fig. 3), a second CBAM attention mechanism processing module (corresponding to the CBAM-2 of fig. 3), and a third CBAM attention mechanism processing module (corresponding to the CBAM-3 of fig. 3), wherein an output of the first MobileNetV3 network serves as an input to the first CBAM attention mechanism processing module; the output of the second MobileNet V3 network is used as the input of the second CBAM attention mechanism processing module; the output of the third MobileNetV3 network serves as the input of the second pooling processing module, and the output of the second pooling processing module serves as the input of the third CBAM attention mechanism processing module. Note that, in fig. 3, 6 and 3 indicate that the feature extraction of data is performed 6 times or 3 times (sequentially) in succession by the M _ bottleck module 6 times.

For example, as shown in FIG. 3, in some embodiments of the present application, the first pooling processing module (corresponding to SPP-1 of FIG. 3) and the second pooling processing module (corresponding to SPP-2 of FIG. 3) each include: the image processing device comprises a first pooling sub-module, a second pooling sub-module, a third pooling sub-module and a splicing sub-module, wherein the first pooling sub-module performs pooling operation on an input image by using pooling of a first size to obtain a pooling map of a first size, the second pooling sub-module performs pooling operation on the input image by using pooling of a second size to obtain a pooling map of a second size, the third pooling sub-module performs pooling operation on the input image by using pooling of a third size to obtain a pooling map of a third size, and the splicing sub-module is configured to perform fusion on one channel on the input image, the pooling map of the first size, the pooling map of the second size and the pooling map of the third size by using a splicing function to obtain an output image of the first pooling processing module or the second pooling processing module. For example, as shown in fig. 2, in some embodiments of the present application, the pooled nuclei of the first size are 5 x 5 pooled nuclei, the pooled nuclei of the second size are 9 x 9 pooled nuclei, and the pooled nuclei of the third size are 13 x 13 pooled nuclei. Some embodiments of the present application use convolution kernels of three sizes for transformation to improve the versatility of the solution.

For example, in some embodiments of the present application, the second feature extraction network uses a PANet network (corresponding to Neck of fig. 3).

It should be noted that, in some embodiments of the present application, a K-means algorithm is used to cluster targets in a data set, so as to obtain a priori frames of a plurality of target sizes; and training an attribute information extraction model according to the prior frame to obtain the target attribute information extraction model. It is understood that some embodiments of the present application use the kmeans + + algorithm to perform clustering on the target box, so as to avoid the problem of too low convergence rate of the algorithm during the training process. For example, the K-means algorithm is a kmeans + + algorithm.

That is to say, the backhaul part in the attribute information extraction model is as shown in fig. 3, and in some embodiments of the present application, in order to increase the detection speed of the target, in a first aspect, a MobileNetV3 network is used as a feature extraction module in a Yolov4 network. As one of the representatives of the lightweight networks, the MobileNet network widely adopts deep separable convolution to replace common convolution in design, so that the parameter quantity is reduced in convolution calculation, and the purpose of lightweight is achieved. The MobileNetV3 is designed by combining the advantages of MobileNetV1 and MobileNetV2, and the network mainly combines four characteristics. Respectively, the deep separable convolution of MobileNetV1, the inverse residual structure with linear bottleneck of MobileNetV2, the lightweight attention mechanism, and the replacement of the swish activation function with the h-swish function. In a second aspect, in some embodiments of the present application, in a backhaul (corresponding to the first feature extraction network) design, a combination of MobileNetV3, SPP pooling, and CBAM attention mechanisms is used to perform feature extraction. The used SPP module is shown in fig. 2, and the module can effectively output a feature vector of a fixed size based on a spatial feature pyramid pooling method. For the input profiles, the SPP module used the maximal pooled 5 x 5 pooled kernel, maximal pooled 9 x 9 pooled kernel, maximal pooled 13 x 13 pooled kernel, respectively, and performed fusion on one channel using the concat function for these pooled results at different scales. It should be noted that, in some embodiments of the present application, a dual SPP module is used in the backhaul feature extraction part, and the first SPP (corresponding to SPP-1 in fig. 3) mainly can effectively output a feature vector with a fixed size, so as to prevent image distortion caused by directly using cropping scaling and the like due to an excessively large image size. The second SPP (corresponding to SPP-2 of fig. 3) mainly extracts spatial feature information of different sizes, and improves the robustness of the model to spatial layout and object degeneration. The expression capability of the characteristic diagram is enriched through the operation of the SPP module, great help is provided for the condition that the size difference of the target in the detected image is large, and the precision of the detection model can be further improved. In a third aspect, in some embodiments of the present application, the CBAM attention mechanism is based on the SENet attention mechanism with the addition of a space-based attention module. The specific steps of channel attention aiming at CBAM are that average pooling and global maximum pooling operations are firstly carried out on input feature maps, then shared Dense layers are used for processing according to pooling results, the processing results are added and are transmitted into a Sigmoid function to obtain the weight of each channel, and multiplication operation is carried out on the weight and the original input feature maps to finish a channel attention mechanism link. The step of spatial attention aiming at CBAM is that along with an input feature map, an average value and a maximum value are taken on a channel of each feature point, two results are stacked, the number of the channels is changed through point convolution, then a Sigmoid function is carried out, the weight of each feature point of an input feature layer is obtained, and the operation of spatial attention is completed by multiplying the weight and the input feature map. After the backhaul feature extraction link, three effective feature maps are output, and then the input result is sent to a Neck (corresponding to a second feature extraction network) link for feature map enhanced extraction.

As shown in fig. 3, the backbone feature extraction network module (backbone) process is as follows:

inputting images with different sizes in training data, performing first SPP space pyramid pooling, performing maximum pooling operation on the images with different sizes (5 × 5, 9 × 9 and 13 × 13) and performing a concat function, and fusing output results with different dimensions by connecting the number of channels. Through the operation, the problem that the sizes of input images are not uniform can be solved, the feature vectors with fixed sizes are effectively output, and the next operation is convenient to carry out.

The result of the first SPP is used as input, common convolution plus regularization (Conv + BN) operation is firstly carried out, then the operation is carried out for 6 times of continuous feature extraction in a mobilenetv3 network module, and the obtained output is sent to a CBAM attention mechanism module to carry out channel and space information weight distribution. The obtained output is used as input and is sent to a hack network for feature information fusion and extraction operation on one hand and is sent to the next feature extraction link on the other hand.

The output of the first CBAM attention mechanism is sent to a mobilenetv3 feature extraction module for 6 times of operation in a trunk network part, and then the output obtained by the second CBAM module is respectively sent to a neck part and the next part of the trunk network.

The output of the second CBAM module is subjected to feature extraction through the mobilenetv3 module for 3 times, and then is subjected to maximum pooling operation on different sizes (5 × 5, 9 × 9, 13 × 13) of features through the second SPP module, so that feature space information is further enriched, and the robustness of the model is improved. And the output of the second SPP is subjected to a third CBAM operation, and the final output is sent to a neck part for the fusion extraction operation of the neck on the features.

To further reduce the number of parameters, some embodiments of the present application use deep separable convolutions instead of normal convolutions in the Neck element of Yolov 4. The module uses a PANet network, and adopts top-down and bottom-up networks to enrich semantic information and position information of upper and lower layers, so as to further improve the detection effect of the model.

It should be noted that, for better model training, some embodiments of the present application use the kmeans + + algorithm to perform target Anchor clustering on the personnel attribute data set. In the field of target detection, the reasonable setting of the Anchor has great influence on the performance of the model, and if the setting is unreasonable or the set size is greatly different from an actual target, the phenomena of detection omission and false detection of the model can be caused. For example, in some embodiments of the present application, before the model training begins, objects in the data set are clustered using the kmeans + + algorithm, resulting in the required 9-size prior boxes. The clustering of the target frames is carried out by using the kmeans + + algorithm, so that the problem of too low convergence speed of the algorithm in the training process can be avoided.

It should be understood that the above network architecture is also a network architecture of the target attribute information extraction model, and in order to avoid repetition, the above network architecture is not excessively repeated when a specific application using the target attribute information extraction model is described below.

As shown in fig. 4, an embodiment of the present application provides a method for acquiring visual attribute information, where the method includes: s101, acquiring an image to be identified; s102, inputting the image to be recognized into a target attribute information extraction model, and obtaining at least one visual attribute information through the target attribute information extraction model, wherein the target attribute information extraction model comprises: the image recognition system comprises a first feature extraction network, a second feature extraction network and an output network, wherein the first feature extraction network adopts deep separable convolution to carry out feature extraction on the image to be recognized to obtain three feature maps, the second feature extraction network carries out feature extraction on the three feature maps again to obtain three target feature maps, and the output network is used for outputting the three target feature maps.

For example, in some embodiments of the present application, the process of obtaining at least one visual attribute information through the target attribute information extraction model at S102 exemplarily includes: inputting the image to be identified into the first pooling processing module to obtain a feature vector with a fixed size; obtaining an initial feature map according to the feature vector and the MobileNet V3 network; and inputting the initial feature map into the second pooling processing module to obtain a target feature map, wherein the target feature map belongs to one of the three feature maps.

For example, in some embodiments of the present application, the CBAM attention mechanism processing module is configured to: and aiming at the input feature graph, taking an average value and a maximum value on a channel of each feature point to obtain two results, stacking the two results, changing the number of the channels through point convolution, obtaining the weight of each feature point of an input feature layer through a Sigmoid function, and multiplying the weight and the input feature graph to finish the operation of space attention.

Some embodiments of the application accomplish the spatial attention operations by a CBAM attention mechanism processing module.

Referring to fig. 4, fig. 4 shows a device for acquiring visual attribute information according to an embodiment of the present application, and it should be understood that the device corresponds to the method embodiment of fig. 1, and is capable of performing the steps related to the method embodiment, and the specific functions of the device may be referred to the description above, and detailed descriptions are omitted here as appropriate to avoid repetition. The device comprises at least one software functional module which can be stored in a memory in the form of software or firmware or solidified in an operating system of the device, and the device for acquiring the visual attribute information comprises: an acquisition module 101 and a visual attribute information acquisition module 102.

An acquisition module 101 configured to acquire an image to be recognized.

A visual attribute information obtaining module 102, configured to input the image to be recognized into a target attribute information extraction model, and obtain at least one piece of visual attribute information through the target attribute information extraction model, where the target attribute information extraction model includes: the image recognition method comprises a first feature extraction network, a second feature extraction network and an output network, wherein the first feature extraction network adopts depth separable convolution to perform feature extraction on the image to be recognized to obtain three feature maps, the second feature extraction network performs feature extraction on the three feature maps again respectively to obtain three target feature maps, and the output network is used for outputting the three target feature maps.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method, and will not be described in too much detail herein.

Some embodiments of the application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, may implement the method as described in any of the embodiments included in the method of obtaining visual property information as described above.

As shown in fig. 5, some embodiments of the present application provide an electronic device 500, which includes a memory 510, a processor 520, and a computer program stored on the memory 510 and executable on the processor 520, wherein the processor 520 reads the program from the memory 510 through a bus 530 and executes the program, so as to implement the method according to any of the embodiments included in the method for obtaining visual attribute information.

Processor 520 may process digital signals and may include various computing structures. Such as a complex instruction set computer architecture, a structurally reduced instruction set computer architecture, or an architecture that implements a combination of instruction sets. In some examples, processor 520 may be a microprocessor.

Memory 510 may be used to store instructions that are executed by processor 520 or data related to the execution of the instructions. The instructions and/or data may include code for performing some or all of the functions of one or more of the modules described in embodiments of the application. The processor 520 of the disclosed embodiments may be used to execute instructions in the memory 510 to implement the method shown in fig. 1. Memory 510 includes dynamic random access memory, static random access memory, flash memory, optical memory, or other memory known to those skilled in the art.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist alone, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A method of obtaining visual attribute information, the method comprising:

acquiring an image to be identified;

inputting the image to be recognized into a target attribute information extraction model, and obtaining at least one visual attribute information through the target attribute information extraction model, wherein the target attribute information extraction model comprises: the image recognition system comprises a first feature extraction network, a second feature extraction network and an output network, wherein the first feature extraction network adopts deep separable convolution to carry out feature extraction on the image to be recognized to obtain three feature maps, the second feature extraction network carries out feature extraction on the three feature maps again to obtain three target feature maps, and the output network is used for outputting the three target feature maps.

2. The method of claim 1, wherein the first feature extraction network comprises a MobileNetV3 network, wherein the MobileNetV3 network employs the deep separable convolution for feature extraction of the image to be recognized.

3. The method of claim 2, wherein the first feature extraction network further comprises a first pooling processing module configured to derive fixed-size feature vectors and a second pooling processing module configured to extract different sizes of spatial feature information.

4. The method of claim 3,

the obtaining of at least one visual attribute information through the target attribute information extraction model includes:

inputting the image to be identified into the first pooling processing module to obtain a feature vector with a fixed size;

obtaining an initial characteristic diagram according to the characteristic vector and the MobileNet V3 network;

and obtaining a target feature map according to the initial feature map and the second pooling processing module, wherein the target feature map belongs to one of the three feature maps.

5. The method of claim 4, wherein the first and second pooling modules each respectively comprise: the image processing device comprises a first pooling sub-module, a second pooling sub-module, a third pooling sub-module and a splicing sub-module, wherein the first pooling sub-module uses pooling of a first size to perform pooling operation on an input image to obtain a pooling image of a first size, the second pooling sub-module uses pooling of a second size to perform pooling operation on the input image to obtain a pooling image of a second size, the third pooling sub-module uses pooling of a third size to perform pooling operation on the input image to obtain a pooling image of a third size, and the splicing sub-module is configured to use a splicing function to perform fusion on one channel on the input image, the pooling image of the first size, the pooling image of the second size and the pooling image of the third size to obtain an output image of the first pooling processing module or the second pooling processing module.

6. The method of claim 5, wherein the pooled nuclei of the first size are 5 x 5 pooled nuclei, the pooled nuclei of the second size are 9 x 9 pooled nuclei, and the pooled nuclei of the third size are 13 x 13 pooled nuclei.

7. The method of any of claims 3-6, wherein the first feature extraction network further comprises a CBAM attention mechanism processing module, wherein the CBAM attention mechanism processing module receives image data output by the MobileNetV3 network, the output data of the CBAM attention mechanism processing module being input data to the second feature extraction network.

8. The method of claim 7, wherein the CBAM attention mechanism processing module is configured to: and aiming at the input feature graph, taking an average value and a maximum value on a channel of each feature point to obtain two results, stacking the two results, changing the number of the channels through point convolution, obtaining the weight of each feature point of an input feature layer through a Sigmoid function, and multiplying the weight and the input feature graph to finish the operation of space attention.

9. The method of claim 7, wherein the MobileNet V3 network comprises a first MobileNet V3 network, a second MobileNet V3 network, and a third MobileNet V3 network, and the CBAM attention mechanism processing module comprises a first CBAM attention mechanism processing module, a second CBAM attention mechanism processing module, and a third CBAM attention mechanism processing module, wherein,

the output of the first MobileNetV3 network is used as the input of the first CBAM attention mechanism processing module;

the output of the second MobileNet V3 network is used as the input of the second CBAM attention mechanism processing module;

the output of the third MobileNetV3 network serves as the input of the second pooling processing module, and the output of the second pooling processing module serves as the input of the third CBAM attention mechanism processing module.

10. The method of claim 9, wherein the second feature extraction network uses a PANet network.

11. The method of claim 1, wherein prior to said inputting the image to be recognized into the target attribute information extraction model, the method further comprises:

clustering targets in the data set by using a K-means algorithm to obtain a priori frames with a plurality of target sizes;

and training an attribute information extraction model according to the prior frame to obtain the target attribute information extraction model.

12. The method of claim 11, wherein the K-means algorithm is a kmeans + + algorithm.

13. An apparatus for obtaining visual attribute information, the apparatus comprising:

an acquisition module configured to acquire an image to be recognized;

a visual attribute information obtaining module, configured to input the image to be recognized into a target attribute information extraction model, and obtain at least one piece of visual attribute information through the target attribute information extraction model, where the target attribute information extraction model includes: the image recognition method comprises a first feature extraction network, a second feature extraction network and an output network, wherein the first feature extraction network adopts depth separable convolution to perform feature extraction on the image to be recognized to obtain three feature maps, the second feature extraction network performs feature extraction on the three feature maps again respectively to obtain three target feature maps, and the output network is used for outputting the three target feature maps.

14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 12.

15. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program is adapted to implement the method of any of claims 1-12.