CN116912924A

CN116912924A - Target image recognition method and device

Info

Publication number: CN116912924A
Application number: CN202311168788.2A
Authority: CN
Inventors: 蒋召; 黄泽元
Original assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Current assignee: Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date: 2023-09-12
Filing date: 2023-09-12
Publication date: 2023-10-20
Anticipated expiration: 2043-09-12
Also published as: CN116912924B

Abstract

The application relates to the technical field of image recognition, and provides a target image recognition method, a target image recognition device, electronic equipment and a computer readable storage medium. The method comprises the following steps: acquiring a target face image; extracting facial features in the target facial image, wherein the facial features comprise 1 global feature and a plurality of local features; after each group of local features are respectively fused with the global features, attention learning is carried out to obtain corresponding enhancement features; based on the enhanced features, facial expression classification results are obtained. The target image recognition method of the embodiment of the application perceives local features and performs attention learning with global features based on the detection of the key points of the facial image, thereby enhancing the features of the effective area of the face and improving the accuracy of the facial expression recognition algorithm.

Description

Target image recognition method and device

Technical Field

The present application relates to the field of image recognition technologies, and in particular, to a target image recognition method, device, electronic apparatus, and computer readable storage medium.

Background

In recent years, with the development of facial expression recognition algorithms, facial expression recognition in a simple scene can achieve a good effect. In facial expression recognition algorithms, feature integrity is an extremely critical factor in the success and failure of the algorithms, existing algorithms too rely on obvious and complete image features, but the existing algorithms fail when occlusion or excessive facial pose exists in a scene, i.e., facial image features are incomplete. The difficulty in facial expression recognition caused by occlusion is mainly manifested in the aspects of feature loss, alignment error or local aliasing caused by occlusion.

Therefore, how to improve expression recognition accuracy in complex scenes through rich feature information is a technical problem to be solved.

Disclosure of Invention

In view of the above, the embodiments of the present application provide a target image recognition method, apparatus, electronic device, and computer readable storage medium, so as to solve the problem of low facial expression recognition accuracy in the case of feature missing in the prior art.

In a first aspect of an embodiment of the present application, there is provided a target image recognition method, including:

acquiring a target face image;

extracting facial features in the target facial image, the facial features including 1 global feature and a plurality of local features;

after each group of local features are respectively fused with the global features, attention learning is carried out to obtain corresponding enhancement features;

based on the enhanced features, facial expression classification results are obtained.

A second aspect of an embodiment of the present application provides a target image recognition device, which is applicable to the facial expression recognition method described in the first aspect, including:

the image acquisition module can acquire a target face image;

a feature extraction module capable of extracting facial features in the target facial image, the facial features including 1 global feature and a plurality of local features;

the attention learning module is capable of carrying out attention learning after fusing each group of local features with the global features so as to obtain corresponding enhancement features;

and the classification and identification module can obtain facial expression classification results based on the enhanced features.

In a third aspect of embodiments of the present application, there is provided an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method of the first aspect when the computer program is executed.

In a fourth aspect of embodiments of the present application, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method of the first aspect.

Compared with the prior art, the embodiment of the application has the beneficial effects that: the embodiment of the application acquires a target face image; extracting facial features in the target facial image, wherein the facial features comprise global features and local features; after each group of local features are respectively fused with the global features, attention learning is carried out to obtain corresponding enhancement features; based on the enhanced features, facial expression classification results are obtained. The facial expression recognition method of the embodiment of the application perceives the local features and performs attention learning with the global features based on the detection of the key points of the facial image, thereby enhancing the features of the effective area of the face and improving the accuracy of the facial expression recognition algorithm.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a target image recognition method according to an embodiment of the present application;

FIG. 2 is a second flowchart of a target image recognition method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of feature extraction of a target image recognition method according to an embodiment of the present application;

FIG. 4 is a third flowchart of a target image recognition method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of attention to learning of a target image recognition method according to an embodiment of the present application;

FIG. 6 is a flowchart of a target image recognition method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a target image recognition device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

A facial expression recognition method, apparatus, electronic device, and storage medium according to embodiments of the present application will be described in detail below with reference to the accompanying drawings.

As described in the background, facial expression recognition is a classification task that recognizes a corresponding emotion by modeling facial changes in visual information. By this technique, the machine can be made to understand the emotion change and make appropriate judgment based on the psychological state. However, in common emotion encoding models, including classification and dimension models, annotation of facial expressions is often very noisy.

The facial expression recognition method mainly uses artificially designed low-dimensional features initially, adopts a classifier corresponding to the features to perform expression recognition classification, for example, extracts discrete cosine transform coefficients as facial expression features, and then performs expression recognition by using the features. Along with research and application of the machine learning algorithm in various fields, the machine learning algorithm is widely adopted and has a certain result in facial expression recognition research, for example, a hidden Markov model is used for facial expression recognition research, but the facial expression recognition classification effect of the method depends on the extracted expression feature description operator, and the anti-interference capability is weak. Along with the continuous development of the deep learning convolutional neural network algorithm and the advantages of the algorithm in the mode recognition task, the facial expression recognition accuracy is obviously improved. For example, the CNN network is applied to facial expression recognition, and the accuracy of facial expression recognition is greatly improved by designing network architectures of convolution kernels with different sizes and extracting multi-level facial expression features; in an algorithm using a convolutional neural network, expression classification is performed by using an SVM after expression features are extracted, and an expression recognition result and the like are obtained. Meanwhile, an algorithm pre-processes the input picture in advance, so that the problem of over-fitting caused by small data sample size is solved, and the facial expression recognition accuracy is improved to a great extent. The inventor searches that the disclosed facial expression recognition algorithm also comprises a cross-linked LeNet-5 convolutional neural network method, and the characteristics of a deep convolutional layer and a shallow convolutional layer of the network are fused, so that the expression classification effect is improved; and a self-attention mechanism is added into the long-period memory network, so that the accuracy of facial expression recognition is effectively improved.

The existing facial expression recognition algorithms all have the capability of solving the facial expression recognition problem under complex scenes, but the algorithms often generate more shielding data or large pose data from the data angle through a data enhancement algorithm, and the expression recognition effect of the algorithms under the complex scenes can be improved through training of the data, so that the algorithms depend on the quality and the quantity of the data. From the perspective of a model network used by a facial expression recognition algorithm, the model network can learn and extract more robust facial features, so that the facial expression recognition effect is improved, and the method and the device solve the technical problem.

It should be noted that, the usual facial expression recognition includes three main steps, namely, face detection, feature extraction, and facial expression classification recognition. On the premise of accurately detecting the face of a person, feature extraction is a key step in facial expression recognition, and also is a key for improving facial expression recognition rate by extracting high-level semantic features capable of better expressing facial expression information.

The feature extraction means that the organ features, texture areas and predefined feature points of the face of the person are positioned and extracted. It is the most important step in facial expression recognition algorithms, and the final facial expression recognition accuracy of the model network often depends on the extraction of valid features. The deep learning method effectively solves the problem of facial expression feature extraction, and can learn expression features in deeper layers in some facial images, for example, facial expression feature extraction is performed by using a deep learning CNN algorithm. Expression feature extraction methods based on deep learning include a direct method, a residual method, a mapping method, a compound method and the like. The direct method is to directly extract high-dimensional features from an input image by using a CNN algorithm, and then to perform facial expression recognition by using the high-dimensional features; the residual method is to divide facial composition information into two parts of expression information and non-expression information, respectively extract the information of the expression composition part and the non-expression information, compare the two parts with each other as a difference value, and take the obtained difference value as an expression characteristic to carry out facial expression recognition; the mapping method is that the non-peak expression is harder to detect, so that facial expression recognition is performed by mapping the non-peak expression into the peak expression and then extracting expression characteristics; the combination method is a method for performing expression recognition by combining various features, for example, a method for combining geometrical features of facial appearance features and facial key points, and the like, and belongs to the combination method.

For facial expression recognition and classification, there are mainly two kinds of Support Vector Machine (SVM) method and classification algorithm based on CNN. Particularly, the deep learning method is currently the mainstream research method in the research of expression recognition technology. The deep learning-based classification algorithm is simple and efficient, and combines the training process of the convolutional neural network, feature extraction and expression classification are directly combined together in the CNN, namely, an output layer is added behind the deep extraction feature network layer, so that the number of the final output neurons is the same as the expression category to be classified, and the cross soil moisture of the correct label and the prediction result is calculated by using a Softmax loss function, so that the final classification result is obtained. The deep learning CNN algorithm avoids the complexity of manually extracting the features based on the traditional method, and is very suitable for identifying and classifying in a large-scale data set by using the CNN algorithm.

Furthermore, the inventors have found that there are a variety of implementations of the attention mechanism, particularly in natural language processing and computer vision scenarios, in studying facial expression recognition algorithms, which help specific problems find reasonable solutions. There are currently two main mechanisms of attention: one is to pay attention to the data by enhancing the most significant part to achieve the purpose of enhancing the effect of the effective part of the data and weakening the effect of the harmful part of the data; secondly, attempts are made to exploit the internal links between various aspects of the data to produce a more meaningful representation of the data. Currently, attention mechanisms applicable to the field of computer vision mainly include, global attention, spatial attention, channel attention, self-attention, and independent attention architecture. The facial expression is often expressed by some key areas of the face, how to capture the key areas becomes the key for improving the recognition performance of the algorithm model, the feature extraction capability of the algorithm model can be effectively enhanced by introducing an attention mechanism, so that the attention mechanism focuses on some key areas of the face, and more discriminative feature expression is obtained, but in the existing method, a single attention module is mostly adopted, which may cause focusing on non-key areas of the face, and all subtle and complex appearance changes among different expressions are difficult to capture.

Therefore, in order to solve the problem of expression recognition under occlusion and large pose scenes, the embodiment of the application provides a target image recognition method based on local feature perception. The technical idea of the method is as follows: firstly, carrying out local feature extraction, randomly cutting a facial picture according to key points, then extracting local features of different areas through a convolutional neural network, then learning the attention of the local features of each branch, and then adding the enhanced features, so that the facial features of the effective area of the face are enhanced, and the accuracy of an expression recognition algorithm is improved.

Fig. 1 is a flowchart of a target image recognition method according to the present application. The method comprises the following steps:

s101: a target face image is acquired.

S102: extracting facial features in the target facial image, wherein the facial features comprise 1 global feature and a plurality of local features.

S103: and after each group of local features are respectively fused with the global features, performing attention learning to obtain corresponding enhancement features.

S104: based on the enhanced features, facial expression classification results are obtained.

Specifically, the face detection is performed by acquiring the target face image. In order to detect facial expressions of a person in an image, the facial detection of the person is generally performed on a dataset image, so as to avoid that the characteristics extracted by the network contain too many non-expressive factors. Common face detection algorithms include: adaboost-based face detection, feature-based face detection, and deep learning-based face detection.

For Adaboost face detection, a plurality of weak classifiers with better recognition performance are usually trained first, then the weak classifiers trained on different training sets are combined into one strong classifier, and then a plurality of strong classifiers are combined into a final cascade classifier, so that the face of a person can be detected quickly and accurately.

For feature-based face detection, most use facial parts such as mouth, eyes, etc. as relevant features for detection, for example, the edge and shape features such as face contour, eyelid contour, etc. can be mapped to geometric units for face detection; and the target face image contains specific gray or color distribution texture features as features for face detection.

For deep learning face detection, face detection models including classical convolutional neural network people are cascades CNN, MTCNN, faceless-Net, and the like. For example, cascades CNN is a face detection method of people based on Cascade of multiple classifiers, and filters a large number of non-facial features while guaranteeing recall, and the design of Cascade network architecture enables faster processing speed and higher efficiency; the MTCNN is a network formed by cascading 3 neural networks P-Net, R-Net and O-Net, so that the performance is better, and real-time accurate face detection and face alignment can be performed on an input image.

It should be appreciated that the face detection algorithm described above is only a few specific embodiments. Any algorithm that can implement face detection of a person can be used to detect the face of a person in a target face image to provide a target face image dataset for subsequent feature extraction and facial expression recognition classification. Therefore, any method capable of outputting the face detection result of the person is within the protection scope of the embodiment of the present application.

In some embodiments, the local features include keypoint features. It should be noted that, the feature extraction of the key points is an important technology in the field of computer vision, and it can extract the positions and features of the key points from the image or video, so as to be used in applications such as image recognition, object tracking, and three-dimensional reconstruction. The principles, methods and applications of keypoint feature extraction will be described herein. The principle of keypoint feature extraction is based on local features of the image, i.e. to find some local areas with uniqueness, stability and repeatability in the image, called keypoints. These key points may be corner points, edges, spots, etc. which have the same features in different images and can be used for matching, identification and tracking. The key points are characterized by calculating information such as gradient, direction, scale and the like of the local image, and SIFT, SURF, ORB and the like are common methods. The methods are described and matched based on the local features of the images, and the same key points can be found in different images, so that the matching and the identification of the images are realized.

It should be appreciated that the keypoint feature extraction algorithm described above is merely a few specific embodiments. All the key point extraction algorithms which can extract key features from facial images so as to provide preliminary feature input for subsequent feature enhancement and facial expression recognition classification are within the protection scope of the embodiment of the application.

In some embodiments, the process of extracting facial features in the target facial image described above, as shown in fig. 2, includes:

s211: and extracting a plurality of key points in the target face image.

S212: and randomly cutting the target face image according to each key point to obtain a key point image.

S213: and respectively inputting the target face image and the key point image into a convolutional neural network with the same parameters to obtain 1 corresponding global feature and a plurality of local features.

Specifically, the input image is processed by using the convolutional neural network, a deep feature map can be generated, and then region generation and loss calculation are completed by using various algorithms, and the skeleton of the whole algorithm of the convolutional neural network, namely a Backbone, has the function of extracting information in the image for later network use. These backhaul networks have strong feature extraction capabilities in terms of classification and the like. When the network is used as a backstage, the trained model parameters can be directly loaded and then the model is connected with a self-defined network structure, and the whole model can train the backstage and the self-defined network at the same time, and only the backstage is required to be finely tuned in the training process so that the backstage is more suitable for the self-defined model task. A typical convolutional neural network generally comprises three basic operations, namely convolution, pooling and full-connection, and abstract features of different scales are extracted by continuously performing convolution and pooling operations on images while preserving important facial information.

It should be noted that, the purpose of inputting the local feature and the global feature into the convolutional neural network with the same parameters is to make the effect of the convolutional neural network with the same parameters on feature extraction of the local feature or the global feature be significantly better than that of the fully connected network if the local feature or the global feature is obtained through the backhaul extraction.

In some embodiments, the keypoint image is adjusted to have the same image size as the target face image prior to input to the convolutional neural network.

Specifically, since factors such as image size and resolution have an influence on the computation performance of the convolutional neural network, the size of the key point image may be adjusted to the same image size as the target face image in this embodiment, so that the convolutional neural network exhibits the same computation performance when extracting the global feature and the local feature.

As shown in fig. 3, assume that key points are extracted from an input face original image of a person, then image cropping is performed on the key points, after the image size of the key points is adjusted to be the same as the size of the face original image of the person, the key points are input into a convolutional neural network of a designated back, and key point features and global features are respectively output, wherein the key point features are local features.

In some embodiments, after each set of the local features are respectively fused with the global features, a process of learning attention is performed to obtain corresponding enhancement features, as shown in fig. 4, including:

s411: and respectively carrying out channel splicing on each group of local features and the global features to obtain corresponding channel splicing features.

S412: and inputting the channel splicing characteristics into an attention mechanics learning unit to obtain corresponding attention force diagram characteristics.

S413: multiplying the channel stitching feature by the attention seeking feature to obtain the corresponding enhancement feature.

In some embodiments, the attention learning unit includes a global average pooling layer, a full connection layer, and a Sigmoid layer in that order.

Specifically, as shown in fig. 5, the extracted key point feature and the global feature are exemplified. The first key point feature and the second key point feature are assumed to be local features obtained after feature extraction, namely local feature branches in the graph. First, the first key point feature and the global feature are subjected to channel stitching. The above-mentioned channel splicing means that the feature sizes are the same, the number of channels is extended, the splicing result is that the feature sizes are unchanged, and the number of channels is added. And then, performing attention learning on the channel splicing characteristics obtained after channel splicing. In one implementation manner of the embodiment of the application, the attention mechanics learning unit sequentially comprises an average pooling layer, a full connection layer and a Sigmoid layer, wherein the Sigmoid layer is a neuron nonlinear function of the convolutional neural network. The enhanced attention seeking characteristic is obtained after passing through the attention learning unit. The attention seeking feature is then multiplied by the channel stitching feature, i.e., by means of a dot product attention weight calculation. Similarly, after the second key point feature performs the above steps, the features of the two partial feature branches after the multiplication are added.

In some embodiments, based on the enhanced features, the process of obtaining the facial expression classification result, as shown in fig. 6, includes:

s611: and adding a plurality of groups of enhancement features to obtain a fusion feature.

S612: and inputting the fusion characteristics to a classification recognition unit to obtain a facial expression classification result.

In some embodiments, the classification recognition unit includes a full connection layer and a classification layer.

Specifically, full-connection layer classification is one classification method commonly used in deep learning. Full-connection layer classification may be used in fields such as image classification, natural language processing, speech recognition, and the like. Wherein the fully connected layer is an important layer in the neural network. It is a graph structure consisting of a collection of nodes, each connected to each node in the previous layer. In deep learning, the function of the fully connected layer is to connect the outputs of all nodes of the previous layer to all nodes of the current layer, thereby obtaining the output of the current layer. The principle of the full connection layer is to establish a connection between each node, then perform a weighted sum operation on all input data, and finally output through an activation function. These outputs pass down to the next layer, and the final output layer may output a distributed representation for classifying the input samples. In deep learning, full-connected layers are often used between the last convolutional layer and the classifier. This is why the output of the fully connected layer can be the final classifier output. In addition, each node in the full connection layer can represent different characteristics, and the characteristics are finally fused into the whole after weight adjustment and layer-by-layer transformation, so that the classification is more accurate.

The embodiment of the application acquires a target face image; extracting facial features in the target facial image, wherein the facial features comprise global features and local features; after each group of local features are respectively fused with the global features, attention learning is carried out to obtain corresponding enhancement features; based on the enhanced features, facial expression classification results are obtained. The facial expression recognition method of the embodiment of the application perceives the local features and performs attention learning with the global features based on the detection of the key points of the facial image, thereby enhancing the features of the effective area of the face and improving the accuracy of the facial expression recognition algorithm.

Any combination of the above optional solutions may be adopted to form an optional embodiment of the present application, which is not described herein.

The following are examples of the apparatus of the present application that may be used to perform the method embodiments of the present application. For details not disclosed in the system embodiments of the present application, please refer to the method embodiments of the present application.

Fig. 7 is a schematic diagram of a target image recognition device according to an embodiment of the present application. As shown in fig. 7, the target image recognition apparatus includes:

the picture acquisition module 701 is capable of acquiring a target face image.

The feature extraction module 702 is capable of extracting facial features in the target facial image, where the facial features include 1 global feature and a plurality of local features.

The attention training module 703 is capable of performing attention learning after fusing each set of the local features with the global features, so as to obtain corresponding enhancement features.

The classification recognition module 704 can obtain facial expression classification results based on the enhanced features.

It should be understood that a facial expression recognition device according to the embodiments of the present disclosure may also perform the methods performed by the facial expression recognition device in fig. 1 to 6, and implement the functions of the facial expression recognition device in the examples shown in fig. 1 to 6, which are not described herein again. Meanwhile, the sequence number of each step in the above embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.

Fig. 8 is a schematic diagram of an electronic device 8 according to an embodiment of the present application. As shown in fig. 8, the electronic device 8 of this embodiment includes: a processor 801, a memory 802, and a computer program 803 stored in the memory 802 and executable on the processor 801. The steps of the various method embodiments described above are implemented by the processor 801 when executing the computer program 803. Alternatively, the processor 801, when executing the computer program 803, performs the functions of the modules/units of the apparatus embodiments described above.

The electronic device 8 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 8 may include, but is not limited to, a processor 801 and a memory 802. It will be appreciated by those skilled in the art that fig. 8 is merely an example of the electronic device 8 and is not limiting of the electronic device 8 and may include more or fewer components than shown, or different components.

The memory 802 may be an internal storage unit of the electronic device 8, for example, a hard disk or a memory of the electronic device 8. The memory 802 may also be an external storage device of the electronic device 8, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device 8. Memory 802 may also include both internal storage units and external storage devices for electronic device 8. The memory 802 is used to store computer programs and other programs and data required by the electronic device.

The processor 801 may be a central processing unit (CentralProcessing Unit, CPU) or other general purpose processor, digital signal processor (Digital SignalProcessor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field-programmable gate array (Field-ProgrammableGate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 801 reads a corresponding computer program from the nonvolatile memory into the memory and then runs the program, and forms a shared resource access control device on a logical level. The processor is used for executing the programs stored in the memory and is specifically used for executing the following operations:

acquiring a target face image;

extracting facial features in the target facial image, wherein the facial features comprise 1 global feature and a plurality of local features;

The facial expression recognition method disclosed in the embodiments shown in fig. 1 to 6 of the present specification may be applied to the processor 801 or implemented by the processor 801. The processor 801 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The above-described processor may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present specification. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present specification may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

Of course, in addition to the software implementation, the electronic device of the embodiments of the present disclosure does not exclude other implementations, such as a logic device or a combination of software and hardware, that is, the execution subject of the following processing flow is not limited to each logic unit, but may also be hardware or a logic device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a USB flash disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a Read-only memory (ROM), a random access memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.

The embodiments of the present specification also propose a computer-readable storage medium storing one or more programs, the one or more programs including instructions, which when executed by a portable electronic device comprising a plurality of application programs, enable the portable electronic device to perform the facial expression recognition method of the embodiments shown in fig. 1 to 6, and in particular to perform the following method:

acquiring a target face image;

In summary, the above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the protection scope of the present specification.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transshipment) such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method for identifying an object image, comprising:

acquiring a target face image;

2. The method of claim 1, wherein the local features comprise keypoint features; and a process of extracting facial features in the target facial image, comprising:

extracting a plurality of key points in the target face image;

randomly cutting the target face image according to each key point to obtain a key point image;

and respectively inputting the target face image and the key point image into a convolutional neural network with the same parameters to obtain 1 corresponding global feature and a plurality of local features.

3. The method of claim 2, wherein the keypoint image is adjusted to have the same image size as the target facial image prior to input to the convolutional neural network.

4. A method according to claim 3, wherein the process of performing attention learning to obtain corresponding enhancement features after fusing each set of local features with the global features, respectively, comprises:

channel stitching is carried out on each group of local features and the global features respectively, so that corresponding channel stitching features are obtained;

inputting the channel splicing characteristics to an attention mechanics learning unit to obtain corresponding attention diagram characteristics;

multiplying the channel stitching feature with the attention seeking feature to obtain the corresponding enhancement feature.

5. The method of claim 4, wherein the attention unit comprises, in order, a global averaging pooling layer, a full connection layer, and a Sigmoid layer.

6. The method of claim 1, wherein the process of obtaining a facial expression classification result based on the enhanced features comprises:

adding a plurality of groups of enhancement features to obtain fusion features;

and inputting the fusion characteristics to a classification and identification unit to obtain a facial expression classification result.

7. The method of claim 1, wherein the classification recognition unit comprises a full connection layer and a classification layer.

8. A target image recognition apparatus, characterized by being adapted to the facial expression recognition method of any one of claims 1 to 7, comprising:

the image acquisition module can acquire a target face image;

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, realizes the steps of the method according to any of claims 1 to 7.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.