CN111860250A

CN111860250A - Image identification method and device based on character fine-grained features

Info

Publication number: CN111860250A
Application number: CN202010655258.0A
Authority: CN
Inventors: 覃俊; 罗一凡; 帖军; 李子茂; 徐胜舟; 叶正; 马天宇
Original assignee: South Central University for Nationalities
Current assignee: South Central Minzu University
Priority date: 2020-07-14
Filing date: 2020-07-14
Publication date: 2020-10-30
Anticipated expiration: 2040-07-14
Also published as: CN111860250B

Abstract

The invention relates to the technical field of image processing, and discloses an image identification method and device based on fine-grained characteristics of people, wherein the method comprises the following steps: acquiring a figure image to be identified; carrying out feature extraction on the figure image to be identified to obtain a figure feature image layer; inputting the figure feature layer into a preset supercolumn feature recognition model to obtain a corresponding image recognition result; acquiring image identification accuracy according to an image identification result; and when the image identification accuracy rate is greater than or equal to a preset standard threshold, taking the image identification result as an image identification result based on the fine-grained character of the person. Compared with the prior art, the method has the advantages that the key area information cannot be accurately acquired due to the fact that the attention mechanism network is used for image processing, so that the image type cannot be accurately identified.

Description

Image identification method and device based on character fine-grained features

Technical Field

The invention relates to the technical field of image processing, in particular to an image identification method and device based on fine-grained characteristics of people.

Background

In daily life, a user has a need of identifying a shot image, in the prior art, in a processing process aiming at image identification, a large number of category semantic features are extracted by using features, so that the method is only suitable for a coarse-grained image classification task, and spatial features such as positions, textures, contours and the like of a large number of bottom layers of the image are lost, so that an attention mechanism network for a fine-grained image feature positioning task cannot efficiently and accurately acquire information of a key area, and cannot accurately identify a person image. Therefore, how to efficiently and accurately acquire the information of the key area of the image so as to accurately identify the person image is an urgent technical problem to be solved.

The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.

Disclosure of Invention

The invention mainly aims to provide an image identification method and device based on character fine-grained characteristics, and aims to solve the technical problem of how to efficiently and accurately acquire information of a key area of an image so as to accurately identify a character image.

In order to achieve the above object, the present invention provides an image recognition method based on fine-grained features of a person, including the following steps:

Acquiring a figure image to be identified;

extracting the characteristics of the figure image to be identified to obtain a figure characteristic image layer;

inputting the figure feature layer into a preset supercolumn feature recognition model to obtain a corresponding image recognition result;

acquiring image identification accuracy according to the image identification result;

and when the image identification accuracy is greater than or equal to a preset standard threshold, taking the image identification result as an image identification result based on the fine-grained character of the person.

Preferably, before the step of acquiring the image to be recognized of the person, the method further includes:

acquiring image training sets corresponding to different characters, and traversing the image training sets to obtain traversed current training images;

obtaining a corresponding sample convolution layer according to the current training image;

extracting a sample characteristic layer from the sample convolution layer;

obtaining a layer pixel point set corresponding to the sample characteristic layer;

superposing the layer pixel point set by a preset up-sampling method to obtain a sample super-column set;

when the traversal is finished, constructing a sample super-column set according to all the obtained sample super-column sets;

respectively preprocessing each sample super-column set in the sample super-column set to obtain a sample target image set;

Obtaining a sample figure identification result corresponding to each sample target image contained in the sample target image set;

and constructing a preset supercolumn feature recognition model according to the training image set and the sample character recognition result.

Preferably, the step of preprocessing each sample supercolumn set in the sample supercolumn set to obtain a sample target image set includes:

traversing the sample super-column set to obtain a traversed current sample super-column set;

preprocessing the current sample super-column set by a preset down-sampling method to obtain a target super-column set;

flattening the target super-column set to obtain a target area;

determining attention area positioning parameters according to the target area;

when the traversal is finished, constructing an attention area positioning parameter set according to all acquired attention area positioning parameters;

and respectively processing each sample target image contained in the sample target image set according to each attention area positioning parameter in the attention area positioning parameter set to obtain a sample target image set.

Preferably, the step of respectively processing each sample target image included in the sample target image set according to each attention area positioning parameter in the attention area positioning parameter set to obtain a sample target image set includes:

Traversing the attention area positioning parameter set to obtain a traversed current attention area positioning parameter;

determining the position of a target area according to the current attention area positioning parameter;

performing area cutting on the current training image according to the position of the target area to obtain a target area image;

amplifying the target area image by a preset bilinear interpolation method to obtain a sample target image;

and at the end of the traversal, constructing a sample target image set according to all the obtained sample target images.

Preferably, after the step of obtaining the sample person recognition result corresponding to each sample target image included in the sample target image set, the method further includes:

inputting the sample feature layer into a preset residual error model to obtain a sample high-dimensional feature layer;

determining a sample class probability loss value according to the sample high-dimensional feature layer;

judging whether the sample class probability loss value is larger than a preset probability threshold value or not;

and when the sample class probability loss value is larger than the preset probability threshold value, executing the step of constructing a preset supercolumn feature recognition model according to the training image set and the sample person recognition result.

Preferably, after the step of determining whether the sample class probability loss value is greater than a preset probability threshold, the method further includes:

and returning to the step of extracting the sample feature layer from the sample convolution layer when the sample class probability loss value is less than or equal to the preset probability threshold.

Preferably, the step of extracting the features of the to-be-identified person image to obtain a person feature map layer includes:

inputting the figure image to be identified into a preset convolutional neural network model to obtain an initial characteristic map layer;

pooling the initial feature map layer to obtain an attention image;

and obtaining a character feature layer according to the attention image and the initial feature layer.

In addition, in order to achieve the above object, the present invention further provides an image recognition apparatus based on fine-grained features of a person, including:

the acquisition module is used for acquiring a figure image to be identified;

the extraction module is used for extracting the characteristics of the figure image to be identified to obtain a figure characteristic map layer;

the recognition module is used for inputting the figure feature layer into a preset supercolumn feature recognition model to obtain a corresponding image recognition result;

The acquisition module is also used for acquiring the image identification accuracy according to the image identification result;

and the judging module is used for taking the image recognition result as an image recognition result based on the fine-grained character of the person when the image recognition accuracy is greater than or equal to a preset standard threshold.

In addition, in order to achieve the above object, the present invention further provides an image recognition apparatus based on fine-grained features of a person, including: the image recognition method comprises the steps of a memory, a processor and an image recognition program based on human fine-grained features, wherein the image recognition program based on human fine-grained features is stored on the memory and can run on the processor, and when being executed by the processor, the image recognition program based on human fine-grained features realizes the steps of the image recognition method based on human fine-grained features.

Furthermore, in order to achieve the above object, the present invention further provides a storage medium having stored thereon an image recognition program based on fine-grained features of persons, which when executed by a processor implements the steps of the image recognition method based on fine-grained features of persons as described above.

The method comprises the steps of firstly obtaining a figure image to be recognized, carrying out feature extraction on the figure image to be recognized to obtain a figure feature layer, then inputting the figure feature layer into a preset supercolumn feature recognition model, accurately positioning a key area of the image to quickly and accurately obtain an image recognition result corresponding to the figure image, finally obtaining an image recognition accuracy according to the image recognition result, and taking the image recognition result as an image recognition result based on figure fine-grained features when the image recognition accuracy is larger than or equal to a preset standard threshold value, so that the figure image recognition efficiency is improved while the figure recognition result is accurate.

Drawings

Fig. 1 is a schematic structural diagram of an image recognition device based on human fine-grained features in a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a schematic flowchart of a first embodiment of an image recognition method based on fine-grained features of a person according to the present invention;

FIG. 3 is a flowchart illustrating a second embodiment of an image recognition method based on fine-grained features of a person according to the present invention;

fig. 4 is a block diagram of a first embodiment of an image recognition apparatus based on fine-grained features of a person according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an image recognition device based on human fine-grained features in a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the image recognition apparatus based on fine-grained features of a person may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), and the optional user interface 1003 may further include a standard wired interface and a wireless interface, and the wired interface for the user interface 1003 may be a USB interface in the present invention. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory or a Non-volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of image recognition apparatuses based on fine-grained features of persons, and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, a memory 1005, identified as a computer storage medium, may include an operating system, a network communication module, a user interface module, and an image recognition program based on fine-grained features of a person.

In the image recognition device based on the fine-grained character features shown in fig. 1, the network interface 1004 is mainly used for connecting with a background server and performing data communication with the background server; the user interface 1003 is mainly used for connecting user equipment; the image recognition device based on the fine-grained character of the person calls an image recognition program based on the fine-grained character of the person stored in the memory 1005 through the processor 1001 and executes the image recognition method based on the fine-grained character of the person provided by the embodiment of the invention.

Based on the hardware structure, the embodiment of the image identification method based on the character fine-grained characteristics is provided.

Referring to fig. 2, fig. 2 is a schematic flowchart of a first embodiment of an image recognition method based on fine-grained features of a person according to the present invention, and the first embodiment of the image recognition method based on fine-grained features of a person according to the present invention is provided.

In a first embodiment, the image recognition method based on the fine-grained character of the person comprises the following steps:

step S10: and acquiring the image of the person to be identified.

It should be noted that the execution subject of the present embodiment is an image recognition device based on fine-grained features of a person, where the device is an image recognition device based on fine-grained features of a person with functions of image processing, data communication, program execution, and the like, and may also be other devices, which is not limited in this embodiment.

Before the step of obtaining the image of the person to be identified, a preset supercolumn feature identification model (HCA-CNN) needs to be established, wherein the preset supercolumn feature identification model is a Convolutional Neural Network model established based on the supercolumn feature idea of image segmentation and fine-grained positioning.

Acquiring image training sets corresponding to different characters, traversing the image training sets to obtain traversed current training images, acquiring a corresponding sample convolution layer according to the current training image, extracting a sample characteristic layer from the sample convolution layer, acquiring a layer pixel point set corresponding to the sample characteristic layer, superposing the pixel point sets of the image layers by a preset up-sampling method to obtain a sample super-column set, when the traversal is finished, constructing a sample super-column set according to all the obtained sample super-column sets, respectively preprocessing each sample super-column set in the sample super-column set to obtain a sample target image set, obtaining a sample character recognition result corresponding to each sample target image contained in the sample target image set, and constructing a preset supercolumn feature recognition model according to the training image set and the sample character recognition result.

In the above, extracting the sample feature map layer from the sample convolution map layer may be inputting a current training image into a convolution neural network to obtain a corresponding sample convolution map layer, or inputting the sample feature map layer into a preset residual error model, that is, a deep residual error network model to obtain a sample high-dimensional feature map layer, where the feature extraction stage may reduce the magnitude of network parameters by continuously simulating feature differentiation, weight sharing of convolution, and pooling, and finally inputting the features into a conventional neural network structure to complete a classification task.

The step of preprocessing each sample superset set in the sample superset set to obtain a sample target image set may be understood as traversing the sample superset set to obtain a traversed current sample superset set, preprocessing the current sample superset set by a preset downsampling method to obtain a target superset set, flattening the target superset set to obtain a target area, determining an attention area positioning parameter according to the target area, constructing an attention area positioning parameter set according to all the obtained attention area positioning parameters when the traversal is finished, and processing each sample target image included in the sample target image set according to each attention area positioning parameter in the attention area positioning parameter set to obtain a sample target image set.

The step of processing each sample target image included in the sample target image set according to each attention area positioning parameter in the attention area positioning parameter set to obtain a sample target image set may be understood as traversing the attention area positioning parameter set to obtain a traversed current attention area positioning parameter, determining a target area position according to the current attention area positioning parameter, performing area clipping on the current training image according to the target area position to obtain a target area image, performing amplification processing on the target area image by a preset bilinear interpolation method to obtain a sample target image, and when the traversal is finished, constructing a sample target image set according to all the obtained sample target images.

For ease of understanding, the following specific steps for constructing the HCA-CNN network model may be:

the character feature data set may be a movie and television character image set, a character image set in daily life, or the like, and this embodiment is not limited thereto.

The following is exemplified by the Beijing opera character feature data set:

according to the different visual characteristics of Beijing Opera characters, a Beijing Opera Role (BJOR) data set facing to the Beijing Opera character recognition task is manufactured, more than 300 Beijing Opera videos of classical Opera are sorted and classified, different video frames are set by a control variable method for image capture, and 273100 pictures are obtained in total; 40000 single target pictures for the image classification task are obtained through screening; the classification is 8, 5000 pieces in each category.

Further, the data set is input to the HCA-CNN for image recognition.

The HCA-CNN network is formed by three layers of scaleless scale networks in an iteration mode, and the structures of the scale networks are the same. The input pictures in the HCA-CNN network firstly pass through a lightweight network (MobileNet V2 classification network), and a feature map of each feature layer is obtained after a series of feature extraction operations. On one hand, the last feature map layer is input into a classifier to be used as a classification task of the current scale; on the other hand, part of the selected stage feature maps are overlapped to form a super column Set, the super column Set is input into a super column based Attention mechanism Network (HC-APN), the HC-APN Network performs down-sampling and full connection operation on the obtained super column Set features, and then key region images are amplified through the extracted key region parameters. And the amplified image is used as an input value of the next layer scale, and iteration is repeated in this way, so that the proportion of a key area is improved, and finally fine classification of fine-grained images is realized.

In the cyclic Attention Convolutional neural network (RA-CNN), it is mentioned that spatial features contribute to the identification of fine-grained images, so that a large number of spatial features existing in the extraction stage of the features of the image of kyoto are worth intensive study. In the feature extraction stage, the magnitude of network parameters is reduced by continuously simulating feature differentiation and weight sharing and pooling of convolution, and finally the features are input into a traditional neural network structure to finish classification tasks. And then showing the image characteristics of the Beijing opera characters through the middle-layer feature map information extracted by the visual characteristics. Taking the MobileNetV2 network as an example, feature map strength information of different feature extraction stages can be shown through a red-blue map, and the stronger the red representative feature is, the weaker the blue representative feature is. It can be observed that the closer to the input layer (ImageInput) the underlying class information (Category characteristics) is, the weaker, the stronger the spatial characteristics (Spatialcharacteristics); the higher the level of the high-level class information closer to the output level (classifier), the weaker the spatial features.

Because the sub-network APN of the RA-CNN only adopts the last layer feature map of the main network depth model (VGG) as an input feature, excessive processing on spatial features is not needed. The method is based on the input characteristic change, and is correspondingly improved on the basis of the sub-network APN, and a new attention mechanism sub-network HC-APN is provided.

The HC-APN subnetwork first downsamples the input features of size 224 x 2024 to size 7 x 2024, then performs two full join operations, first flattening the features to a size 1 x 16192 and second to a size 1 x 3 (the number of channels 3 represents the three parameters tx, ty, tl for attention area localization).

Furthermore, the attention network is used for determining the position of the target area, and after a coordinate relation is defined, the clipped attention area is obtained by a method of element multiplication of a Maskm function and an input image X. And then, performing region amplification on the determined target region by using a bilinear interpolation method through the following formula, acquiring a corresponding image recognition result according to the region amplified image through the steps, and then constructing a preset supercolumn feature recognition model according to the training image set and the sample character recognition result.

And performing joint loss function formula calculation on the successfully constructed preset supercolumn feature recognition model to verify whether the preset supercolumn feature recognition model meets the requirements, and when the preset supercolumn feature recognition model does not meet the requirements, adjusting parameters in the preset supercolumn feature recognition model to obtain the preset supercolumn feature recognition model with higher accuracy.

Step S20: and extracting the characteristics of the figure image to be identified to obtain a figure characteristic image layer.

The extracting of the feature layer from the to-be-identified person image may be inputting the to-be-identified person image into a preset convolutional neural network model to obtain an initial feature layer, performing pooling processing on the initial feature layer to obtain an attention image, obtaining the person feature layer according to the attention image and the initial feature layer, or inputting the feature layer into a preset residual model, that is, a depth residual network model to obtain a sample high-dimensional feature layer, and the like, which is not limited in this embodiment.

Step S30: and inputting the figure feature layer into a preset supercolumn feature recognition model to obtain a corresponding image recognition result.

In the structural schematic diagram of the lightweight classification model, the extracted feature map (namely, a character feature map layer) is subjected to multi-Task learning, Task1 represents a super column Set (HyperColumn Set) for learning HC-APN, and Task2 represents a feature map for learning classification tasks.

In Task1, the feature maps are different in size from the original image, and the respective layers of feature maps and the original image are superimposed on a pixel-by-pixel basis, and upsampling (upsampling) is required first.

After upsampling is carried out on each layer of feature map, the features of different layer feature maps of pixel points at different positions are shown, and the k feature maps are subjected to accumulation operation, which means that the channels of different feature maps are overlapped together, but not added numerically. The f value is not the whole feature map, but the input image size is 224x224 only for a certain pixel position i, so the hyperColumn Set consists of 224x224 fi, that is, after upsampling each layer feature map, superposition can be performed. Since the input image size is fixed to 224 × 224, the super-column set is composed of 224 × 224 super columns of pixel points i.

Because the sub-network APN of the RA-CNN only adopts the last layer feature map of the main network VGG as an input feature, excessive processing on spatial features is not needed. The method is based on the input characteristic change, and is correspondingly improved on the basis of the sub-network APN, and a new attention mechanism sub-network HC-APN is provided.

The HC-APN subnetwork first downsamples the input features of size 224x 2024 to size 7 x 2024, then performs two full join operations, first flattening the features to a size 1 x 16192 and second to a size 1 x 3 (the number of channels 3 represents the three parameters tx, ty, tl for attention area localization).

Further, the attention network is used for determining the position of the target area, and after a coordinate relation is defined, the clipped attention area is obtained by a method of element multiplication of a Mask m function and the input image X.

The Mask m function can select the most important region in forward propagation, then the determined target region is subjected to region amplification by a bilinear interpolation method, and finally image recognition is carried out according to the amplified image so as to obtain a corresponding image recognition result.

Step S40: and acquiring the image identification accuracy according to the image identification result.

Step S50: and when the image identification accuracy is greater than or equal to a preset standard threshold, taking the image identification result as an image identification result based on the fine-grained character of the person.

It can be understood that an image recognition accuracy corresponding to the image recognition result can also be output in the preset supercolumn feature recognition model, and the image recognition accuracy may be 50%, 60%, 90%, or the like.

Assuming that the image recognition accuracy rate corresponding to the current image recognition result is 80% and the preset standard threshold is 70%, where the preset standard threshold is set by a user in a self-defined manner, and the embodiment is not limited, and if it is known that 80% is greater than 70%, the image recognition result is used as an image recognition result based on the fine-grained features of the person.

In this embodiment, a character image to be recognized is obtained, feature extraction is performed on the character image to be recognized to obtain a character feature layer, the character feature layer is input into a preset supercolumn feature recognition model, a key area of the image can be accurately located, an image recognition result corresponding to the character image is rapidly and accurately obtained, an image recognition accuracy is obtained according to the image recognition result, and when the image recognition accuracy is greater than or equal to a preset standard threshold, the image recognition result is used as an image recognition result based on character fine-grained features, so that the character image recognition efficiency is improved while the character recognition result is accurate.

In addition, referring to fig. 3, fig. 3 is a first embodiment of the image recognition method based on the fine-grained feature of the person, and a second embodiment of the image recognition method based on the fine-grained feature of the person is proposed.

In the second embodiment, before the step S10 in the image recognition method based on fine-grained features of a person, the method further includes:

step S001: and acquiring image training sets corresponding to different characters, traversing the image training sets, and acquiring traversed current training images.

The following is exemplified by a character feature data set and a Beijing opera character feature data set in daily life:

1. data set classification

The classification of the character feature set in daily life can be divided according to the characteristics of the age, sex, occupation and the like of the characters. Among them, 8 kinds of category labels can be set, including: "Zhongnian _ Nanxing _ Bailing", "Qingnian _ Nanxing _ Junren", "Qingnian _ Nanxing _ Xuesheng", "Zhongnian _ Nvxing _ Getihu", "Qingnian _ Nvxing _ Xuesheng", "Lanian _ Nvxing _ Zhufu".

The head (Headwear), the Face (Face), the Beard (Beard), the clothing (Clothes) and other parts are adopted to distinguish the characteristics of the Type.

Part of characteristics are selected from the following characteristics for introduction:

(1) middle-aged male white collar (Zhongnian _ Nanxing _ Bailing): the head is characterized by no cap and neat hairstyle; the clothes are characterized by mostly black business suit, white shirt and negative tie;

(2) young male soldiers (Qingnian _ Nanxing _ Junren): the head is characterized in that the clothes are mostly characterized by wearing army caps and flat-head hairstyles; the color of the clothes is characterized by army green;

(3) Young male students (Qingnian _ Nanxing _ Xuesheng): the head is characterized by a flat head; most clothes are dark blue and white school uniforms;

(4) middle-aged female teacher (Zhongnian _ Nvxing _ Getihu): the face features are mostly glasses; the clothes are mostly characterized by carrying textbooks;

(5) young female students (Qingnian _ Nvxing _ Xuesheng): the head is characterized by long hair or student hair style; most clothes are dark blue and white school uniforms;

(6) housewives of elderly women (Laonian _ Nvxing _ Zhufu): the head is mostly grey-white; the dress features are mostly a wearing apron.

A daily character (BJOR) data set for a person recognition task is created according to the difference in visual characteristics of persons in daily life, 1200 photographed images are acquired and sorted, and 200 photographed images are classified into 6 categories. Wherein the acquired images are collated to acquire a corresponding training set of images.

According to the different visual characteristics of Beijing Opera characters, a Beijing Opera Role (BJOR) data set facing to the Beijing Opera character recognition task is produced, more than 300 Beijing Opera videos of classical Opera are sorted and classified, different video frames are set by a control variable method for image capture, 273100 pictures are obtained in total, 40000 single-target pictures for the image classification task are obtained through screening, the single-target pictures are divided into 8 types, and 5000 pictures of each type are obtained. Wherein the acquired images are collated to acquire a corresponding training set of images.

In addition, a Beijing Opera Role (BJOR) data set facing to the Beijing Opera character recognition task is also produced according to different visual characteristics of the Beijing Opera characters, more than 300 Beijing Opera videos of classical dramas are sorted and classified, different video frames are set by a control variable method for image capture, 273100 pictures are obtained in total, single target pictures 40000 for the image classification task are obtained through screening, and the single target pictures are divided into 8 types and 5000 pieces of each type. Wherein the acquired images are collated to acquire a corresponding training set of images.

The classification of the characters of the Beijing opera is performed according to the characteristics of the characters such as age, gender, character and the like. We select representative 8 rows as category labels, and setting basic category labels includes: "LaoSheng", "WuSheng", "XiaoaoSheng", "ZhengDan", "HuaDan", "LaoDan", "jingJue", "ChouJue".

By referring to the related data such as Beijing opera costume atlas, the head ornament (Headwear), the Face mask (Face), the Beard (Beard), the costume (Clothes), the Sleeve (Sleeve), the waistband (Belt) and other parts are adopted to distinguish the features of the guidang (Type).

Some of the features are selected for example and introduction: middle-aged man, young man and young woman

Old (LaoSheng): the beard is characterized by diversified, black, pale and white, three beard in shape and full of beard; the facial makeup is light in overall appearance due to facial features;

xiaosheng (Xiaoosheng): the bearded mouth is characterized by being bearded mouth, and the shape of the mouth can be observed; the facial makeup is characterized by thick makeup and deep red lips;

wu sheng (wu sheng): the bearded mouth is characterized by being bearded mouth, and the mouth shape can be observed; the facial makeup features appear as deeper red lips; clothes are mostly characterized by white long leaning;

positive denier (ZhengDan): the facial makeup is characterized by thick makeup; the head decoration is characterized by wearing silver bubbles; the sleeves are characterized by having water sleeves;

flower denier (HuaDan): the dress is characterized by being mostly characterized by a meal list and a wadded skirt; the headwear is characterized by being mostly characterized by wearing bright head faces and rhinestones; the sleeves are characterized as waterless sleeves; in addition, "handkerchiefs" are also a special identification feature;

old denier (LaoDan): facial makeup is characterized by a shallow makeup; the clothes are mostly characterized by yellow, grey white and dark green folds (a kind of casual clothes); in addition, "crutches" are also their special identification features;

net angle (JingJue): beard is often characterized as full; the facial makeup is characterized by thick facial makeup, and comprises various categories such as specific ' whole face ', ' three tiles ', broken face ' and the like;

Ugly corner (ChouJue): the facial makeup is characterized in that a piece of white powder is smeared at the nose bridge; the beard of the wind is characterized by wind, beard, sleeve, beard and sleeve.

Step S002: and acquiring a corresponding sample convolution layer according to the current training image.

The method includes selecting a current training image from the current training set to extract a sample convolution layer, where the current training image may be input into a preset convolution neural network model to obtain an initial feature layer, performing pooling processing on the initial feature layer to obtain an attention image, and obtaining the sample convolution layer according to the attention image and the initial feature layer, which is not limited in this embodiment.

Step S003: and extracting a sample characteristic layer from the sample convolution layer.

And selecting a part of sample feature layers with clear image profiles from the sample convolution layers, wherein the sample convolution layers are arranged according to the definition degree of the image, and assuming that one image has three layers, the definition degree of the image profiles from the upper layer to the middle layer to the lower layer is gradually reduced, so that the layer corresponding to the lower layer can be selected as the sample feature layer according to the user requirement.

Step S004: and acquiring a layer pixel point set corresponding to the sample characteristic layer.

Step S005: and superposing the layer pixel point set by a preset up-sampling method to obtain a sample super-column set.

In the invention, the middle layer feature map of the MobileNet V2 classification network is shown, wherein (a) represents the feature map of the bottommost layer, and obvious contour characteristics can be seen; (b) and (c) represents the featuremap of the intermediate stage, and the outline characteristic effect is weakened; (d) features such as feature map, outline, etc. representing higher layers have disappeared.

In the process of feature extraction, in order to meet classification tasks, semantic information of the current class of the Beijing opera is continuously enhanced, and spatial features are weakened (including character postures, joints of limbs, stage lighting intensity, stage positions and the like).

In the process of extracting the features of the convolutional neural network, because of gradual weakening of spatial features and continuous enhancement of category semantic features, feature maps at different stages present larger feature differences. By using the thought of a gating network structure for feature fusion and a scale-dependent pooling SDP algorithm for reference, the supercolumn feature for image segmentation is applied to the task of fusing the space features and the category features of characters of the feature map Beijing opera in different stages through the following formula, wherein the formula is as follows:

f_i＝∑_ka_ikF_k(1)

In the formula, i is a certain pixel point of an input Beijing opera image, and fi is a feature vector, namely, supercolumn, sigma, serially connected with corresponding positions in each layer of feature map_kThe accumulation operation (non-numerical addition) is performed for k feature maps, and aik is the position of the pixel and the feature map.

Further, the HCA-CNN network proposed by the Beijing opera character image research is formed by three layers of hierarchical structures in an iteration mode, the network structures of all layers are the same, and partial features of each layer can serve as input information of the next layer. The image is input into an HCA-CNN network, and is firstly classified by a MobileNet V2 network, and a series of feature extraction operations are carried out to obtain feature maps of each intermediate layer. On one hand, inputting the last feature map layer into a classifier for a classification task of the current scale; and on the other hand, the feature map of part of the middle layers is overlapped based on pixel points to form a super-column Set and is input into the HC-APN sub-network, the HC-APN network performs down-sampling and full-connection operations on the obtained HyperColumn Set characteristics, then key area image amplification is performed through the extracted key area parameters, and the amplified image is used as the input of the next layer.

Considering the end-to-end application scenario of the Beijing opera character recognition task and the requirements of real-time recognition, the patent proposes that a MobileNet V2 network with less parameters and higher operation efficiency is used as a backbone network, and MobileNet V2 is beneficial to the end-to-end real-time scenario. The structure of the VGG is similar to that of the VGG in composition, and different conv2d and bottleeck structures are stacked.

In the structural schematic diagram of the lightweight classification model, the extracted feature map is subjected to multi-Task learning, Task1 represents a super column Set (hyperColumn Set) for learning HC-APN, and Task2 represents a feature map for learning classification tasks.

In Task1, the feature maps are different in size from the original image, and the respective layers of feature maps and the original image are superimposed on a pixel-by-pixel basis, and upsampling (upsampling) is required first. In order to obtain the f value of the function of the point P, linear interpolation is first performed in the x direction, and the upsampling formula is:

then linear interpolation is carried out in the y direction to obtain:

where P ═ x, y denotes a point representing upsampling insertion, and Q₁₁＝(x₁,y₁)、Q₁₂＝(x₁,y₂)、Q₂₁＝(x₂,y₁)、Q₁₂＝(x₂,y₂) The values of four pixels existing in the original image, wherein R₁＝(x,y₁)、R₂＝(x,y₂) Are pixel point values.

Step S006: and at the end of traversal, constructing a sample supercolumn set according to all the obtained sample supercolumn sets.

Step S007: and respectively preprocessing each sample supercolumn set in the sample supercolumn set to obtain a sample target image set.

The HC-APN subnetwork first downsamples the input features with size 224 x 2024 to size 7 x 2024, then performs two full join operations, first flattening the features to size 1 x 16192 and second to size 1 x 3 (channel number 3 represents three parameters tx, ty, tl for attention area localization), and then the attention network is used to determine the target area location by the following formula, which is:

in the formula, (tx, ty) is the central coordinate point of the region, tl is half of the side length of the square region, (tx (tl), ty (tl)) is the upper left-corner coordinate of the target region, (tx (br), ty (br)) is the lower right-corner coordinate of the target region.

Further, after a coordinate relation is defined, a clipped attention area is obtained by a method of element multiplication of a Mask m function and an input image X:

X^att＝X·M(t_x,t_y,t_l) (5)

wherein X is an input image, X ^attThe clipped region is obtained by a method of element multiplication of a Mask m function and an input image X.

The Mask m function can select the most important region in forward propagation, and is easy to optimize in backward propagation due to the characteristic of a continuous function:

M(.)＝[h(x-t_x(tl)-h(x-t_x(br))]·[h(y-t_y(tl)-h(y-t_y(br)](6)

h (x) in Mask m function is step function:

in the formula, k is a set positive integer, h (x) is a step function, and exp is an exponential function with a natural constant e as a base.

When-kx tends to be positive infinity, the denominator also tends to be positive infinity, and then h (x) tends to be 0; when-kx tends to minus infinity, the second half of the denominator tends to 0, so that the entire denominator tends to 1, and h (x) tends to 1. t is t_x(tl)≤x≤t_x(br)，h(x-t_x(tl))-h(x-t_x(br)) Tending to the same 1, y-axis. Therefore, only when x is between tx (tl) and tx (br) and y is between ty (tl) and ty (br), M (.) tends to 1, and the other tends to 0.

Further, we then use a bilinear interpolation method to perform region amplification on the determined target region by the following formula:

and (i, j) is the coordinate of the addition point after the image is amplified.

Step S008: and acquiring a sample person identification result corresponding to each sample target image contained in the sample target image set.

Step S009: and constructing a preset supercolumn feature recognition model according to the training image set and the sample character recognition result.

Through the steps, joint loss function formula calculation is carried out on the successfully constructed preset supercolumn feature recognition model to verify whether the preset supercolumn feature recognition model meets the requirements, when the preset supercolumn feature recognition model does not meet the requirements, parameters in the preset supercolumn feature recognition model need to be adjusted, and the joint loss function formula is as follows:

in the formula, Lclx is the category loss, the loss generated by the predicted Beijing opera character category of the three-layer classification network compared with the real line label is Lrank, the loss generated when the recognition rate of the middle and the upper layers of the front and the rear layers is lower than that of the lower layer, X is the input image, Y(s) is the predicted category probability, Y is the real category, P is the real category_t ^(s)Probability of s-layer real label class, P_t ^(s)-P_t ^(s+1)In the case that the probability of the category of the s-layer network is higher than the loss generated by the s +1 layer, margin is a padding value and can be 0.05, and Max { } is the generated loss, which can be understood as the loss obtained by taking the difference value of 0 and more than 0 when less than 0.

And finally, detecting the model identification effect through a detection formula according to Top1 and Top5 indexes, namely the model accuracy, taking the maximum probability vector as a prediction result, and if the classification with the maximum probability in the prediction result is correct, predicting correctly. Otherwise the prediction is wrong. The detection formula is as follows:

In the formula, Topl _ accuracy is the maximum prediction result in the prediction probability vector, TP is the number of negative classes predicted by the negative classes, FP is the number of positive classes predicted by the negative classes, which can be a false alarm rate, FN is the number of negative classes predicted by the positive classes, which can be a false alarm rate, and TN is the number of positive classes predicted by the positive classes

It can also be understood that, in the first five names with the largest probability vectors, if the correct probability occurs, the prediction is correct, otherwise, the prediction is wrong.

In addition, the patent also adopts a series of indexes to evaluate the complexity of the network, wherein the time complexity evaluation index comprises the following steps: FLOPs, spatial complexity assessment indicators include: memory Usage, Million Params, Million Muti-Adds.

The time complexity determines the training and prediction time of the model, and the space complexity determines the parameter quantity and the access quantity of the model, wherein the parameter quantity represents the total weight parameter quantity of all the belt parameters of the model. The complexity of the convolutional neural network is therefore related to the feature map size M information output by the convolution kernel. The overall time complexity calculation formula is as follows:

the overall spatial complexity calculation formula is as follows:

wherein, feature map size M is (X-K +2 Padding)/Stride +1, X is the input matrix size, K is the convolution kernel size, Padding is the Padding value, and Stride is the step size.

For a self-made BJOR data set, carrying out a comparative ablation experiment by combining different layers (scales) of different weak supervision networks and recursion networks, wherein the selected combination is as follows:

(1)VGG16

(2)RA-CNN(VGG16+APN)

(3)MobileNetV2

(4)MobileNetV2+APN

(5)MobileNetV2+HCAPN+HC(scale 2)

(6)MobileNetV2+HCAPN+HC(scale 3)

(7)MobileNetV2+HCAPN+HC(scale 1+2)

(8)MobileNetV2+HCAPN+HC(scale 1+2+3)

the BJOR data set (80%) is selected as a network training set, the data set (20%) is selected as a network verification set, and model accuracy can be obtained according to model training results, wherein table 1 is referred to as an ablation experiment accuracy comparison table.

TABLE 1

As can be seen from table 1, the MobileNetV2 network has a small reduction of approximately 1.8% compared to VGG16, and likewise, MobileNetV2 is slightly less than 1.7% for the three-layer recursive network of the combined APN attention network. Then, the hc (hypercolumn) feature is added in the text, ablation experiments are carried out on different levels (scales) of the recursive network, and a certain increase of level combination compared with a single layer can be observed. The study integrates a (MobileNet V2+ HCAPN + HC) network of three layers of scales, namely a preset supercolumn feature recognition model, the accuracy rate reaches 91.58%, the accuracy rate is improved by 0.63% compared with that of a VGG16+ APN network combination of an RA-CNN model based on a circulating soft attention mechanism, and the problem that the positioning of the attention mechanism is not efficient and accurate enough is effectively solved.

In this embodiment, first, image training sets corresponding to different characters are obtained, the image training sets are traversed to obtain traversed current training images, corresponding sample convolution layers are obtained according to the current training images, then, sample feature layers are extracted from the sample convolution layers, then, layer pixel point sets corresponding to the sample feature layers are obtained, the layer pixel point sets are subjected to superposition processing through a preset up-sampling method to obtain sample super-column sets, when traversal is finished, a sample super-column set is constructed according to all the obtained sample super-column sets, each sample super-column set in the sample super-column set is preprocessed to obtain a sample target image set, sample character recognition results corresponding to each sample target image included in the sample target image set are obtained, and a preset super-column feature recognition model is constructed according to the training image sets and the sample character recognition results, compared with the prior art, the method has the advantages that a large number of category semantic features are extracted by using the features, so that the image processing process is complex and tedious, and the key area of the image cannot be accurately positioned.

Furthermore, an embodiment of the present invention further provides a storage medium, where an image recognition program based on fine-grained features of a person is stored, and the image recognition program based on fine-grained features of a person implements the steps of the image recognition method based on fine-grained features of a person as described above when being executed by a processor.

In addition, referring to fig. 4, an embodiment of the present invention further provides an image recognition apparatus based on a fine-grained feature of a person, where the image recognition apparatus based on a fine-grained feature of a person includes:

the acquisition module 4001 is used for acquiring a figure image to be identified;

an extraction module 4002, configured to perform feature extraction on the person image to be identified, so as to obtain a person feature map layer;

the recognition module 4003 is configured to input the character feature layer into a preset supercolumn feature recognition model to obtain a corresponding image recognition result;

the obtaining module 4001 is further configured to obtain an image recognition accuracy according to the image recognition result;

the judging module 4004 is configured to, when the image recognition accuracy is greater than or equal to a preset standard threshold, take the image recognition result as an image recognition result based on fine-grained features of a person.

Other embodiments or specific implementation manners of the image recognition device based on the fine-grained character of the person may refer to the above method embodiments, and are not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order, but rather the words first, second, third, etc. are to be interpreted as names.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be substantially implemented or a part contributing to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g., a Read Only Memory (ROM)/Random Access Memory (RAM), a magnetic disk, an optical disk), and includes several instructions for enabling a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. An image identification method based on fine-grained characteristics of people is characterized by comprising the following steps:

acquiring a figure image to be identified;

2. The method of claim 1, wherein the step of obtaining the image of the person to be identified is preceded by the step of:

extracting a sample characteristic layer from the sample convolution layer;

3. The method of claim 2, wherein the step of separately pre-processing each superset of the set of supersets of samples to obtain a set of sample target images comprises:

Flattening the target super-column set to obtain a target area;

determining attention area positioning parameters according to the target area;

4. The method according to claim 3, wherein the step of processing each sample target image included in the sample target image set according to each attention area localization parameter in the attention area localization parameter set to obtain a sample target image set comprises:

5. The method of claim 2, wherein after the step of obtaining the sample person recognition result corresponding to each sample target image included in the sample target image set, the method further comprises:

6. The method of claim 5, wherein the step of determining whether the sample class probability loss value is greater than a preset probability threshold further comprises:

7. The method of claim 1, wherein the step of extracting the features of the image of the person to be recognized to obtain the person feature map layer comprises:

pooling the initial feature map layer to obtain an attention image;

8. An image recognition device based on human fine-grained features is characterized by comprising:

the acquisition module is used for acquiring a figure image to be identified;

9. An image recognition device based on human fine-grained features, which is characterized by comprising: a memory, a processor and an image recognition program based on human fine-grained features stored on the memory and operable on the processor, the image recognition program based on human fine-grained features implementing the steps of the image recognition method based on human fine-grained features according to any one of claims 1 to 7 when executed by the processor.

10. A storage medium, characterized in that the storage medium stores thereon an image recognition program based on fine-grained features of a person, the image recognition program based on fine-grained features of a person realizing the steps of the image recognition method based on fine-grained features of a person according to any one of claims 1 to 7 when executed by a processor.