CN114419391A

CN114419391A - Target image identification method and device, electronic equipment and readable storage medium

Info

Publication number: CN114419391A
Application number: CN202111621115.9A
Authority: CN
Inventors: 魏晓明; 王致岭; 闵巍庆; 康丽萍; 魏晓林; 蒋树强
Original assignee: Institute of Computing Technology of CAS; Beijing Sankuai Online Technology Co Ltd
Current assignee: Institute of Computing Technology of CAS; Beijing Sankuai Online Technology Co Ltd
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2022-04-29

Abstract

The invention discloses a target image identification method and device, electronic equipment and a readable storage medium. Wherein, the method comprises the following steps: inputting a target image into a target image recognition model which is trained in advance; acquiring a plurality of local features of a plurality of different dimensions corresponding to a target image through a feature learning module of a target image recognition model, and determining a plurality of first vector representations corresponding to the plurality of local features; determining a second vector representation corresponding to a plurality of local features through a feature enhancement module of the target image recognition model; and classifying the plurality of first vector representations and the plurality of second vector representations through a feature classification module of the target image recognition model to obtain a prediction result of the target image. The invention solves the technical problem of low accuracy of the food image prediction result in the related technology.

Description

Target image identification method and device, electronic equipment and readable storage medium

Technical Field

The invention relates to the technical field of image recognition, in particular to a target image recognition method and device, electronic equipment and a readable storage medium.

Background

The main task of food image recognition is to identify the type of food in the food image by using computer technology, or identify other semantic information with different granularities. The wide popularization of various portable devices (such as mobile phones, cameras and the like) and wearable devices (such as wearable cameras) and the rapid development of artificial intelligence technology make food image recognition have wide application prospects. For example, by identifying the category, food material or other attribute information of dishes, the nutrient content of the dishes can be analyzed, the eating habits of the user can be evaluated, and the health supervision and disease prevention and control of the user can be realized. The food image recognition can realize automatic settlement of food by recognizing meal, fresh fruits and vegetables, packaged food and the like of customers, and is applied to unmanned restaurants, unmanned supermarkets and food industry. In addition, food image organization retrieval of food recommendation and social network sites can be further realized through food image recognition. Because of this, food image recognition has gradually become a research hotspot in multiple fields of computer vision, multimedia, industrial informatics, medical and health informatics, agriculture, and bioengineering.

Unlike the conventional fine-grained image recognition task, food image recognition is a task that needs to consider image features of different granularities at the same time for recognition. The food image has obvious global characteristics and local characteristics.

Most of existing food image identification related methods directly utilize a traditional convolutional neural network to identify food, characteristics of food images are not considered, namely global features and local features are not effectively combined, and a food identification method is partially based on the traditional fine-grained identification idea, so that the local features are single. In addition, although some methods also adopt an attention mechanism for food identification, the information interaction between local features of different scales cannot be realized by generally adopting channel or space attention.

It can be seen that no effective solution to the above problems has been proposed in the related art.

Disclosure of Invention

The embodiment of the invention provides a target image identification method and device, electronic equipment and a readable storage medium, which at least solve the technical problem of low accuracy of a food image prediction result in the related art.

According to an aspect of an embodiment of the present invention, there is provided a target image recognition method including: inputting a target image into a target image recognition model which is trained in advance; acquiring a plurality of local features of a plurality of different dimensions corresponding to the target image through a feature learning module of the target image recognition model, and determining a plurality of first vector representations corresponding to the local features; determining, by a feature enhancement module of the target image recognition model, a second vector representation corresponding to the plurality of local features; classifying, by a feature classification module of the target image recognition model, the plurality of first vector representations and the second vector representation to obtain a prediction result of the target image.

Further, the feature enhancement module includes an attention module, a feature fusion module, a global enhancement module, and a second classifier, and the determining, by the feature enhancement module of the target image recognition model, a second vector representation corresponding to the plurality of local features includes: processing, by the attention module, the plurality of local features to obtain a plurality of enhanced local features of the same dimension; performing, by the global enhancement module, feature enhancement on a specified local feature to obtain an enhanced global feature, where the specified local feature is a local feature of a highest dimension among the plurality of local features; performing feature fusion on the plurality of enhanced local features and the global feature through the feature fusion module to obtain a joint feature; classifying, by the second classifier, the joint feature to obtain the second vector representation.

Further, the feature learning module includes a plurality of feature extraction units corresponding to the plurality of different dimensions, and a plurality of feature processing units corresponding to the plurality of feature extraction units, wherein determining, by the feature learning module of the target image recognition model, a plurality of first vector representations corresponding to the plurality of local features based on the plurality of local features of the plurality of different dimensions corresponding to the target image includes: performing, by the plurality of feature extraction units, feature extraction on the target image to obtain the plurality of local features; determining, by the plurality of feature processing units, the plurality of first vector representations based on the plurality of local features.

Further, the feature processing unit includes a feature processing subunit and a first classifier, and the feature processing subunit is connected to the first classifier, where before inputting the target image into the pre-trained target image recognition model, the method further includes: sequentially training the plurality of feature processing subunits according to a preset sequence through a preset training sample set until the plurality of feature processing subunits converge, and determining first parameters of the plurality of feature processing subunits; initializing the target image recognition model according to the first parameter, and training the target image recognition model according to the preset training sample set until the target image recognition model is converged.

Further, the feature learning module includes K feature extraction units and K feature processing subunits corresponding to K dimensions, where K is a positive integer, and training the feature processing subunits in sequence according to a preset order through a preset training sample set until the feature processing subunits converge, and determining first parameters of the feature processing subunits includes: and training the plurality of feature processing subunits in sequence according to the sequence of feature dimensions from low to high, wherein in the S step, the front K-S +1 feature processing subunits of the feature learning module are trained, and S is a positive integer smaller than K.

Further, training the plurality of feature processing subunits in sequence according to the sequence of feature dimensions from low to high includes: determining category probability distribution corresponding to the feature processing subunits according to a plurality of sample local features corresponding to training samples in a preset training sample set; determining a cross entropy loss function corresponding to each feature processing subunit according to the category probability distribution; determining a relative entropy loss function according to two local sample characteristics corresponding to two adjacent characteristic processing subunits; training the plurality of feature processing sub-units based on the cross entropy loss function and the relative entropy loss function.

According to another aspect of the embodiments of the present invention, there is also provided a target image recognition apparatus including: the input unit is used for inputting the target image to a pre-trained target image recognition model; a first determining unit, configured to determine, by a feature learning module of the target image recognition model, a plurality of first vector representations corresponding to a plurality of local features of different dimensions corresponding to the target image based on the plurality of local features; a second determining unit, configured to determine, by a feature enhancement module of the target image recognition model, a second vector representation corresponding to the plurality of local features; and the classification unit is used for classifying the plurality of first vector representations and the second vector representation through a feature classification module of the target image recognition model so as to obtain a prediction result of the target image.

Further, the feature enhancement module includes an attention module, a feature fusion module, a global enhancement module, and a second classifier, and the second determination unit includes: a first processing subunit, configured to process, by the attention module, the plurality of local features to obtain a plurality of enhanced local features of the same dimension; the second processing subunit is configured to perform, by the global enhancement module, feature enhancement on a specified local feature to obtain an enhanced global feature, where the specified local feature is a local feature with a highest dimension in the multiple local features; a third processing subunit, configured to perform, by the feature fusion module, feature fusion on the multiple enhanced local features and the global feature to obtain a joint feature; a classification subunit, configured to classify, by the second classifier, the joint feature to obtain the second vector representation.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions implementing the steps of the target image recognition method as described above when executed by the processor.

According to another aspect of the embodiments of the present invention, there is also provided a readable storage medium on which a program or instructions are stored, the program or instructions, when executed by a processor, implementing the steps of the target image recognition method as described above.

In the embodiment of the invention, a target image is input to a target image recognition model which is trained in advance; acquiring a plurality of local features of a plurality of different dimensions corresponding to a target image through a feature learning module of a target image recognition model, and determining a plurality of first vector representations corresponding to the plurality of local features; determining a second vector representation corresponding to a plurality of local features through a feature enhancement module of the target image recognition model; and classifying the plurality of first vector representations and the plurality of second vector representations through a feature classification module of the target image recognition model to obtain a prediction result of the target image. In the processing of the food image, the global characteristic and the local characteristic of the food image can be acquired, so that different raw material information, food shape and other information can be effectively captured, the characteristic enhancement is carried out according to a plurality of local characteristics of different dimensions of the food image, and context information of different dimensions is blended into the local characteristics to enrich the local characteristic representation. And finally, integrating the overall features and the local features, and further improving the technical effect of the food identification accuracy based on the new features obtained after the global features and the local features are integrated, thereby solving the technical problem of low accuracy of the food image prediction result in the related technology.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic flow chart diagram of an alternative target image recognition method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an alternative target image recognition model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an alternative target image recognition apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

In a practical application scenario, food image recognition can be regarded as a hierarchical task, and food images in different categories have obvious separable visual differences, so that recognition can be performed based on global features of the food images. But under the same category, the difference between food images of different sub-categories is very small, so that the food images can be regarded as a task of fine-grained identification. Based on the above two points, in the technical solution of this embodiment, it is necessary to extract the global feature and the local feature of the food image at the same time, and then determine the global feature and the local feature.

According to an embodiment of the present invention, there is provided a target image recognition method, as shown in fig. 1, the method including:

s102, inputting a target image into a target image recognition model which is trained in advance;

the target image in the embodiment of the application may be a food image, and may also be an image of other objects, where the food image may be a food image of any dish type and any material. In the present embodiment, the image format of the target image is not limited at all, and may be set specifically according to practical experience, for example, JPEG, PNG, RAW, or other formats.

In addition, for convenience of performing image feature extraction, in this embodiment, pixels of the target image are defined, specifically, a preset pixel value may be set according to practical experience, in a case that the pixel value of the target image is greater than the preset pixel value, the target image is subjected to a cropping process, in a case that the pixel value of the target image is less than the preset pixel value, the target image includes, but is not limited to, image enhancement, filling, and the like, which is not limited in this embodiment.

Specifically, after the target image is acquired through a preset platform or a preset database, the target image is preprocessed and then input into a video image recognition model which is trained in advance.

S104, acquiring a plurality of local features of a plurality of different dimensions corresponding to the target image through a feature learning module of the target image recognition model, and determining a plurality of first vector representations corresponding to the local features;

next, the target image recognition model in the present embodiment is described:

in this embodiment, the feature learning module is actually an image prediction sub-network, and can perform prediction according to a plurality of local features of the target image to obtain a prediction result of the target image. Meanwhile, the feature learning module, the feature enhancing module and the feature classifying module form a complete image prediction network.

The feature learning module is used for extracting features of a target image input into the target image recognition model to obtain a plurality of local features of the target image in different dimensions, and conducting preliminary prediction on the local features to obtain first vector representations corresponding to the local features.

Optionally, in this embodiment, the feature learning module includes a plurality of feature extraction units corresponding to a plurality of dimensions, and a plurality of feature processing units corresponding to the plurality of feature extraction units, where the determining, by the feature learning module of the target image recognition model, a plurality of first vector representations corresponding to a plurality of local features based on a plurality of local features of a plurality of dimensions corresponding to the target image includes, but is not limited to: performing feature extraction on the target image through a plurality of feature extraction units to obtain a plurality of local features; determining, by a plurality of feature processing units, a plurality of first vector representations based on the plurality of local features.

In the present embodiment, as shown in fig. 2, the target image recognition model in the present embodiment includes a feature learning module 20, a feature enhancing module 21, and a feature classifying module 22, wherein the feature learning module 20 includes a plurality of feature extracting units 200 and a feature processing unit 202.

The image features extracted by the feature extraction units 200 in the feature learning module 20 are image features with different dimensions, for example, the feature extraction units 200 respectively include different numbers of convolution layers, and then the target image is subjected to a plurality of local features extracted by the different numbers of convolution layers.

In this embodiment, the local features are image features of different dimensions, and then the local features are processed and classified by the feature processing units 202 to obtain a plurality of first vector representations output by the feature processing units 202.

Further optionally, in this embodiment, the feature processing unit includes a feature processing subunit and a first classifier, where the feature processing subunit includes a pooling layer, a full-link layer and a first classifier, where the feature extraction unit is connected in series with the pooling layer, the full-link layer and the first classifier corresponding thereto, and respectively inputs the plurality of local features to the plurality of feature processing units to obtain a plurality of first vector representations, including but not limited to: the local features are processed through the pooling layer, the full-link layer, and the first classifier to obtain a first vector representation.

As shown in fig. 2, the feature processing unit 202 includes a pooling layer 2021, a full connection layer 2022, and a local feature extracted by the first classifier 2023 through the feature extraction unit 200 connected to the feature processing unit 202, and inputs the local feature into the corresponding feature processing unit 202, and the classifier 2024 outputs a first vector representation after the local feature passes through the pooling layer 2021, the full connection layer 2022, and the first classifier 2023, wherein the pooling layer is a global maximum pooling layer.

In this embodiment, a plurality of local features are obtained by a plurality of feature extraction units and are respectively input to a plurality of feature processing units to determine classification results of target images under different information granularities.

Optionally, in this embodiment, the local features are processed by the pooling layer, the full-link layer and the classifier to obtain the first vector, which further includes but is not limited to: in two fully-connected layers in feature processing units of adjacent dimensions, KL divergence feature enhancement is performed on image features input into the two fully-connected layers to obtain a first vector representation.

In this embodiment, in order to constrain the feature distribution of the local features of the feature extraction units in different layers and avoid the homogenization or similarity of different local features, KL divergence is introduced between the fully-connected layers in adjacent feature processing units, and KL divergence feature enhancement is performed on the image features input into the two fully-connected layers.

For the local features output by the feature extraction units of different layers, the method further adopts KL divergence to enhance the difference between the local features, and the calculation mode is as follows:

wherein y is_iAnd y_jFor the output distributions of different stages, in the present embodiment, it is assumed that only all corresponding values between the last three layers of feature extraction units are calculated, so that N is 3.

S106, determining second vector representations corresponding to the local features through a feature enhancement module of the target image recognition model;

in a specific application scenario, unlike a general fine-grained task, a target image does not have fixed semantic information. The local discriminant features are characterized by diversity and multiscale. Most of the existing food image recognition methods mostly independently mine the discriminant features without considering the relationship between the discriminant features. Therefore, in the embodiment, a self-attention mechanism is adopted to mine the relationship between the local features, and further capture the common features of the food images in the same category.

Optionally, in this embodiment, the feature enhancing module includes an attention module, a feature fusion module, a global enhancing module, and a second classifier, and the second vector representation corresponding to the plurality of local features is determined by the feature enhancing module of the target image recognition model, which includes but is not limited to: processing the plurality of local features through an attention module to obtain a plurality of enhanced local features of the same dimension; performing feature enhancement on the designated local features through a global enhancement module to obtain enhanced global features, wherein the designated local features are local features with the highest dimensionality in the plurality of local features; performing feature fusion on the plurality of enhanced local features and the global feature through a feature fusion module to obtain a combined feature; the combined features are classified by a second classifier to obtain a second vector representation.

In this embodiment, the specified local features output by the feature extraction unit with the highest dimension in the feature learning module are input to the feature enhancement module, and the image features output by the feature enhancement module are taken as global target image features, that is, global features.

As shown in fig. 2, the feature enhancement module 21 includes but is not limited to an attention module 210, a feature fusion module 212, a global enhancement module 214 and a second classifier 216,

still taking the target image recognition model shown in fig. 2 as an example for illustration, the global enhancement module 214 includes, but is not limited to, a global average pooling layer 2140 and a full-connected layer 2142, the local features of the highest dimension are input to the feature enhancement module 21, and the global average pooling layer 2140 and the full-connected layer 2142 are used to obtain global image features, so as to obtain global features.

In this embodiment, a plurality of local features output from the feature learning module are selected, the number of the selected local features may be set according to actual needs, and the dimensionality of the selected local features is a higher local feature with different dimensionalities. And inputting the selected local features into an attention module to obtain a plurality of enhanced local features.

Specifically, as shown in fig. 2, in the attention module 210, three sets of local features are sequentially selected from the highest dimension, the three sets of local features are input to the attention module 210, context information of different scales is extracted through a self-attention mechanism, the three sets of local features are subjected to feature enhancement, then the three sets of local features are unified into the same dimension through different convolution layers, and the same three image features are merged concat. Finally, three enhanced local features enhanced by the attention module can be obtained, so that cross-size interaction among the local features is realized.

Then, feature fusion module 212 performs feature fusion on the plurality of enhanced local features and global features output by attention module 210, and fuses the enhanced local features and global features of the target image to obtain a joint feature. The combined features are then input into a second classifier 214, and the combined features are classified by the second classifier to determine a classification result corresponding to the global features. A second vector representation is obtained.

And S108, classifying the plurality of first vector representations and the plurality of second vector representations through a feature classification module of the target image recognition model to obtain a prediction result of the target image.

After the plurality of first vector representations output by the feature learning module and the second vector representation output by the feature classification module are obtained, the target image is predicted according to the plurality of first vector representations and the second vector representation to obtain a prediction result of the target image.

Optionally, in this embodiment, the feature processing unit includes a feature processing subunit and a first classifier, and the feature processing subunit is connected to the first classifier, where before inputting the target image into the pre-trained target image recognition model, the method further includes, but is not limited to: training the plurality of feature processing subunits in sequence according to a preset sequence through a preset training sample set until the plurality of feature processing subunits converge, and determining first parameters of the plurality of feature processing subunits; initializing a target image recognition model according to the first parameter, and training the target image recognition model according to a preset training sample set until the target image recognition model is converged.

In this embodiment, the training samples in the preset training sample set are a binary sample, i.e., < food image, label >. Wherein, the food image is an input training sample, and the label is a training target. And acquiring a training sample set through a preset network platform or a database, and then training the target image model according to the training sample set.

Specifically, in the present embodiment, in the training process of the target image recognition model, a plurality of feature processing subunits in the feature learning module in the target image recognition model are trained through a preset training sample set until the feature learning module converges. After the training of the feature learning module is completed, parameters of the feature processing subunits are determined, and then the target image recognition model is initialized based on the parameters of the feature processing subunits, so that the whole model is trained.

Optionally, in this embodiment, the feature learning module includes K feature extraction units and K feature processing subunits corresponding to K dimensions, where K is a positive integer, and the plurality of feature processing subunits are sequentially trained according to a preset order through a preset training sample set until the plurality of feature processing subunits converge, and the first parameters of the plurality of feature processing subunits are determined, including but not limited to: and training the plurality of feature processing subunits in sequence according to the sequence of feature dimensions from low to high, wherein in the S step, the front K-S +1 feature processing subunits of the feature learning module are trained, and S is a positive integer smaller than K.

Further optionally, in this embodiment, the training is performed on the plurality of feature processing subunits sequentially according to the order of feature dimensions from low to high, including but not limited to: determining category probability distribution corresponding to a plurality of feature processing subunits according to a plurality of sample local features corresponding to training samples in a preset training sample set; determining a cross entropy loss function corresponding to each feature processing subunit according to the category probability distribution; determining a relative entropy loss function according to the local characteristics of the two samples corresponding to the two adjacent characteristic processing subunits; training the plurality of feature processing sub-units based on the cross entropy loss function and the relative entropy loss function.

Specifically, in the present embodiment, the convolutional layers of the lower-level stages are trained first, and then new hierarchical stage convolutional layers are gradually added and introduced to participate in the training process. Because the receptive field in the lower hierarchy stage convolution layer is smaller, the training strategy enables the image recognition model to dig local features with higher discriminability, such as raw material features of food. As the number of convolutional layer stages increases, the network can then capture global food image features, such as shape information. Therefore, the incremental training method can further extract more discriminative global-local target image features.

Specifically, the training process of the target image recognition model is divided into S steps, and each step trains the first K-S +1 stages of the local feature extraction network, wherein K is the number of stages of the convolutional layer network. In particular, the output of the last convolutional layer for the network to be trained in each step

Defining a convolution layer F_iProcessing the output features, F_iComposed of a convolution layer, a BN (Batch Normalization) layer and a ReLU (Rectified Linear Unit) Unit, so that a local feature representation can be obtained

Defining a classifier for each local feature

The method comprises two full-connection layers and one BN layer, and based on the method, the corresponding class probability distribution of each stage after progressive training can be obtained.

In this embodiment, for the output of each stage, a cross entropy loss function is used to calculate the corresponding loss, so the classification loss of each stage can be defined as:

where M is a set of training samples, x^kThe k-th training sample is represented,

are the corresponding category labels.

For the local features output by the different stages, a relative entropy (KL divergence) user function is used in this embodiment to enhance the difference between the local features input to the fully-connected layer, and the calculation method is as follows:

wherein, y_iAnd y_jIn this embodiment, all the corresponding values between the last three layers are calculated for the output distributions of different stages.

On the other hand, for a feature enhancement module in the target image recognition model, different from a common fine-grained task, the target image does not have fixed semantic information. The local discriminant features in the target image exhibit the characteristics of diversity and multi-scale. The present embodiment therefore employs a self-attention mechanism to mine the relationship between local features, based on which co-occurrence features of target images under the same category are further captured. Specifically, the local feature representation f of the last N hierarchical stages of the local feature learning network is extracted first_iA feature pyramid is constructed in which the fine (coarse) granularity feature map is in the low (high) dimension. The attention module is then defined as follows:

q_i＝Conv_i(f_i)

k_j,v_j＝Conv_j(f_j)

wherein f is_iAnd f_jRepresenting local features input at different locations, so that enhanced local features can be obtained

And finally, unifying the enhanced local features into the same dimension by using N convolutional layers, and merging concat the N enhanced local features with the same size. Finally, N local features L enhanced by the attention module can be obtainedⁱ. Based on this strategy, the target image recognition model can further enable cross-size interaction between features.

In the training process of the target image recognition model, after the global features and the attention-enhanced local features are obtained, the global features and the attention-enhanced local features are combined for training. The final image features are then expressed as:

f_concat＝concat(f^G,L¹,……,Lⁿ)

y_total＝C_concat(f_concat)

wherein, C_concatIs a second classifier consisting of two fully connected layers, two BN layers and one ELU activation unit. For the final fused features, the cross entropy loss function is:

then, in a feature classification module of the target image recognition model, the outputs of the feature learning module and the feature enhancement module are combined to calculate corresponding output scores so as to predict corresponding categories. The method comprises the following steps:

y_combine＝concat(y₁,……,y_n,y_total)

in this embodiment, optimization will be performed by a combined loss function, including a cross-entropy loss function and a relative entropy (KL divergence) loss function, and the total loss function is as follows:

L＝αL_concat+βL_kL

wherein, α and β are balance parameters, and can be set according to actual requirements.

According to the embodiment, the target image is input to the target image recognition model which is trained in advance; acquiring a plurality of local features of a plurality of different dimensions corresponding to a target image through a feature learning module of a target image recognition model, and determining a plurality of first vector representations corresponding to the plurality of local features; determining a second vector representation corresponding to a plurality of local features through a feature enhancement module of the target image recognition model; and classifying the plurality of first vector representations and the plurality of second vector representations through a feature classification module of the target image recognition model to obtain a prediction result of the target image. In the processing of the food image, the global characteristic and the local characteristic of the food image can be acquired, so that different raw material information, food shape and other information can be effectively captured, the characteristic enhancement is carried out according to a plurality of local characteristics of different dimensions of the food image, and context information of different dimensions is blended into the local characteristics to enrich the local characteristic representation. And finally, integrating the overall features and the local features, and further improving the technical effect of the food identification accuracy based on the new features obtained after the global features and the local features are integrated, thereby solving the technical problem of low accuracy of the food image prediction result in the related technology.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

According to an embodiment of the present invention, there is also provided an object image recognition apparatus for implementing the object image recognition method, as shown in fig. 3, the apparatus including:

1) an input unit 30 for inputting a target image to a target image recognition model trained in advance;

2) a first determining unit 32, configured to determine, by a feature learning module of the target image recognition model, a plurality of first vector representations corresponding to a plurality of local features of different dimensions corresponding to the target image based on the plurality of local features;

3) a second determining unit 34, configured to determine, by a feature enhancement module of the target image recognition model, a second vector representation corresponding to the plurality of local features;

4) a classification unit 36, configured to classify, by a feature classification module of the target image recognition model, the plurality of first vector representations and the second vector representation to obtain a prediction result of the target image.

Optionally, in this embodiment, the feature enhancing module includes an attention module, a feature fusion module, a global enhancing module, and a second classifier, and the second determining unit 34 includes but is not limited to:

1) a first processing subunit, configured to process, by the attention module, the plurality of local features to obtain a plurality of enhanced local features of the same dimension;

2) the second processing subunit is configured to perform, by the global enhancement module, feature enhancement on a specified local feature to obtain an enhanced global feature, where the specified local feature is a local feature with a highest dimension in the multiple local features;

3) a third processing subunit, configured to perform, by the feature fusion module, feature fusion on the multiple enhanced local features and the global feature to obtain a joint feature;

4) a classification subunit, configured to classify, by the second classifier, the joint feature to obtain the second vector representation.

Optionally, the specific example in this embodiment may refer to the example described in embodiment 1 above, and this embodiment is not described again here.

Example 3

There is also provided, according to an embodiment of the present invention, an electronic device, including a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions, when executed by the processor, implementing the steps of the target image recognition method as described above.

Optionally, in this embodiment, the memory is configured to store program code for performing the following steps:

s1, inputting the target image into the pre-trained target image recognition model;

s2, acquiring a plurality of local features of a plurality of different dimensions corresponding to the target image through a feature learning module of the target image recognition model, and determining a plurality of first vector representations corresponding to the local features;

s3, determining a second vector representation corresponding to the local features through a feature enhancement module of the target image recognition model;

s4, classifying the first vector representations and the second vector representations through a feature classification module of the target image recognition model to obtain a prediction result of the target image.

Example 4

Embodiments of the present invention also provide a readable storage medium on which a program or instructions are stored, which when executed by a processor implement the steps of the target image recognition method as described above.

Optionally, in this embodiment, the readable storage medium is configured to store program code for performing the following steps:

Optionally, the storage medium is further configured to store program codes for executing the steps included in the method in embodiment 1, which is not described in detail in this embodiment.

Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for identifying a target image, comprising:

inputting a target image into a target image recognition model which is trained in advance;

acquiring a plurality of local features of a plurality of different dimensions corresponding to the target image through a feature learning module of the target image recognition model, and determining a plurality of first vector representations corresponding to the local features;

determining, by a feature enhancement module of the target image recognition model, a second vector representation corresponding to the plurality of local features;

classifying, by a feature classification module of the target image recognition model, the plurality of first vector representations and the second vector representation to obtain a prediction result of the target image.

2. The method of claim 1, wherein the feature enhancement module comprises an attention module, a feature fusion module, a global enhancement module, and a second classifier,

determining, by a feature enhancement module of the target image recognition model, a second vector representation corresponding to the plurality of local features, including:

processing, by the attention module, the plurality of local features to obtain a plurality of enhanced local features of the same dimension;

performing, by the global enhancement module, feature enhancement on a specified local feature to obtain an enhanced global feature, where the specified local feature is a local feature of a highest dimension among the plurality of local features;

performing feature fusion on the plurality of enhanced local features and the global feature through the feature fusion module to obtain a joint feature;

classifying, by the second classifier, the joint feature to obtain the second vector representation.

3. The method of claim 1, wherein the feature learning module comprises a plurality of feature extraction units corresponding to the plurality of different dimensions and a plurality of feature processing units corresponding to the plurality of feature extraction units, wherein,

determining, by a feature learning module of the target image recognition model, a plurality of first vector representations corresponding to a plurality of local features based on the plurality of local features of the plurality of different dimensions corresponding to the target image, including:

performing, by the plurality of feature extraction units, feature extraction on the target image to obtain the plurality of local features;

determining, by the plurality of feature processing units, the plurality of first vector representations based on the plurality of local features.

4. The method of claim 3, wherein the feature processing unit comprises a feature processing subunit and a first classifier, the feature processing subunit being connected to the first classifier, wherein,

before inputting the target image into the pre-trained target image recognition model, the method further comprises:

sequentially training the plurality of feature processing subunits according to a preset sequence through a preset training sample set until the plurality of feature processing subunits converge, and determining first parameters of the plurality of feature processing subunits;

initializing the target image recognition model according to the first parameter, and training the target image recognition model according to the preset training sample set until the target image recognition model is converged.

5. The method according to claim 3, wherein the feature learning module comprises K feature extraction units and K feature processing subunits corresponding to K dimensions, K being a positive integer, wherein,

training the plurality of feature processing subunits in sequence according to a preset sequence through a preset training sample set until the plurality of feature processing subunits converge, and determining first parameters of the plurality of feature processing subunits, including:

the plurality of feature processing subunits are trained in sequence according to the sequence of feature dimensions from low to high,

and training the first K-S +1 feature processing subunits of the feature learning module in the S step, wherein S is a positive integer smaller than K.

6. The method of claim 5, wherein training the plurality of feature processing subunits sequentially according to a low-to-high order of feature dimensions comprises:

determining category probability distribution corresponding to the feature processing subunits according to a plurality of sample local features corresponding to training samples in a preset training sample set;

determining a cross entropy loss function corresponding to each feature processing subunit according to the category probability distribution; and the number of the first and second groups,

determining a relative entropy loss function according to two local sample characteristics corresponding to two adjacent characteristic processing subunits;

training the plurality of feature processing sub-units based on the cross entropy loss function and the relative entropy loss function.

7. An object image recognition apparatus, comprising:

the input unit is used for inputting the target image to a pre-trained target image recognition model;

a first determining unit, configured to determine, by a feature learning module of the target image recognition model, a plurality of first vector representations corresponding to a plurality of local features of different dimensions corresponding to the target image based on the plurality of local features;

a second determining unit, configured to determine, by a feature enhancement module of the target image recognition model, a second vector representation corresponding to the plurality of local features;

and the classification unit is used for classifying the plurality of first vector representations and the second vector representation through a feature classification module of the target image recognition model so as to obtain a prediction result of the target image.

8. The apparatus of claim 7, wherein the feature enhancement module comprises an attention module, a feature fusion module, a global enhancement module, and a second classifier, and wherein the second determination unit comprises:

a first processing subunit, configured to process, by the attention module, the plurality of local features to obtain a plurality of enhanced local features of the same dimension;

the second processing subunit is configured to perform, by the global enhancement module, feature enhancement on a specified local feature to obtain an enhanced global feature, where the specified local feature is a local feature with a highest dimension in the multiple local features;

a third processing subunit, configured to perform, by the feature fusion module, feature fusion on the multiple enhanced local features and the global feature to obtain a joint feature;

a classification subunit, configured to classify, by the second classifier, the joint feature to obtain the second vector representation.

9. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the target image recognition method of claims 1-6.

10. A readable storage medium, characterized in that it stores thereon a program or instructions which, when executed by a processor, implement the steps of the object image recognition method according to claims 1-6.