CN114550236B

CN114550236B - Training method, device, equipment and storage medium for image recognition and model thereof

Info

Publication number: CN114550236B
Application number: CN202210082039.7A
Authority: CN
Inventors: 杨馥魁; 温圣召; 韩钧宇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-01-24
Filing date: 2022-01-24
Publication date: 2023-08-15
Anticipated expiration: 2042-01-24
Also published as: CN114550236A

Abstract

The disclosure provides a training method, device, equipment and storage medium for image recognition and a model thereof, relates to the technical field of artificial intelligence, in particular to the technical field of deep learning and computer vision, and can be applied to scenes such as face recognition and face image recognition. The image recognition method comprises the following steps: carrying out feature extraction processing on an image to obtain local features of the image, wherein the local features are used for expressing features in an area of the image; acquiring global features of the image, wherein the global features are used for expressing inter-region features of the image; and acquiring an image recognition result of the image based on the local feature and the global feature. The present disclosure can improve image recognition accuracy.

Description

Training method, device, equipment and storage medium for image recognition and model thereof

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of deep learning and computer vision, and can be applied to scenes such as face recognition, face image recognition and the like, and particularly relates to a training method, device, equipment and storage medium for image recognition and a model thereof.

Background

Face recognition is a biological recognition technology for carrying out identity recognition based on facial feature information of people. Generally, a face recognition model may be used to recognize an input face image to obtain a face recognition result.

Disclosure of Invention

The present disclosure provides a training method, apparatus, device and storage medium for image recognition and model thereof.

According to an aspect of the present disclosure, there is provided an image recognition method including: carrying out feature extraction processing on an image to obtain local features of the image, wherein the local features are used for expressing features in an area of the image; acquiring global features of the image, wherein the global features are used for expressing inter-region features of the image; and acquiring an image recognition result of the image based on the local feature and the global feature.

According to another aspect of the present disclosure, there is provided a training method of an image recognition model, including: performing feature extraction processing on an input image sample by adopting an initial image recognition model to obtain local features of the image sample, wherein the local features are used for expressing features in a region of the image sample; acquiring global features of the image sample, wherein the global features are used for expressing inter-region features of the image sample; based on the local features and the global features, obtaining a prediction recognition result of the image sample; constructing a loss function based on the predicted recognition result and a real recognition result corresponding to the image sample; based on the loss function, parameters of the initial image recognition model are adjusted to generate a final image recognition model.

According to another aspect of the present disclosure, there is provided an image recognition apparatus including: the first acquisition module is used for carrying out feature extraction processing on the image so as to acquire local features of the image, wherein the local features are used for expressing the features in the region of the image; the second acquisition module is used for acquiring global features of the image, wherein the global features are used for expressing inter-region features of the image; and the identification module is used for acquiring an image identification result of the image based on the local feature and the global feature.

According to another aspect of the present disclosure, there is provided a training apparatus of an image recognition model, including: the first acquisition module is used for carrying out feature extraction processing on an input image sample by adopting an initial image recognition model so as to acquire local features of the image sample, wherein the local features are used for expressing the features in the region of the image sample; the second acquisition module is used for acquiring global features of the image sample, wherein the global features are used for expressing inter-region features of the image sample; the prediction module is used for acquiring a prediction recognition result of the image sample based on the local feature and the global feature; the construction module is used for constructing a loss function based on the prediction recognition result and the real recognition result corresponding to the image sample; and the generation module is used for adjusting parameters of the initial image recognition model based on the loss function so as to generate a final image recognition model.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method according to any one of the above aspects.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the above aspects.

According to the technical scheme, the image recognition accuracy can be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a seventh embodiment of the present disclosure;

FIG. 8 is a schematic diagram according to an eighth embodiment of the present disclosure;

fig. 9 is a schematic diagram of an electronic device used to implement an image recognition method or training method of an image recognition model of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. The embodiment provides an image recognition method, which includes:

101. and carrying out feature extraction processing on the image to obtain local features of the image, wherein the local features are used for expressing the features in the areas of each area of the image.

102. And acquiring global features of the image, wherein the global features are used for expressing inter-region features of the various regions.

103. And acquiring an image recognition result of the image based on the local feature and the global feature.

The execution body of the embodiment may be an image recognition device, where the image recognition device may be located in an electronic device, the electronic device may be a user terminal or a server, and the user terminal may include: personal computer (Personal Computer, PC), mobile device, smart home device, wearable device, etc., mobile device includes cell phone, portable computer, tablet computer, etc., smart home device includes smart speaker, smart television, etc., and wearable device includes smart watch, smart glasses, etc. The server may be a local server or a cloud server, etc.

Specifically, taking image recognition as face recognition as an example, as shown in fig. 2, a user may install an Application (APP) 201 capable of performing face recognition on a mobile device (such as a mobile phone), where the APP 201 may collect a face image through a face collecting device (such as a camera) on the mobile device, and then if the mobile device itself has face recognition capability, the processor of the mobile device may perform face recognition on the collected face image to obtain a face recognition result. Alternatively, APP 201 may send the face image to server 202, and server 202 may be located in a local server or cloud. The APP 201 and the server 202 may use a communication network for data transmission. The server 202 may perform face recognition on the received face image to obtain a face recognition result, and feed back the face recognition result to the APP 201.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

In the image recognition, feature extraction processing may be performed on an image to obtain an image feature of the image, where the image feature may be specifically a feature map (feature map). For the sake of distinction, since the processing is performed on the image as a whole, the features extracted here are considered to be intra-region features, and may be referred to as local features. When the feature is a feature map, the local feature may be referred to as a local feature map. Correspondingly, a feature may be referred to as a global feature if a set of features is capable of expressing inter-region features for different regions of an image.

Taking the image as a face image and taking the local feature as a local feature map as an example, as shown in fig. 2, the face image can be input into a convolutional neural network (Convolutional Neural Networks, CNN), the face image is subjected to feature extraction processing by adopting the convolutional neural network, and the output of the convolutional neural network is the local feature map corresponding to the input face image. The network structure of the CNN may be VGG, or RetNet, for example.

In fig. 2, the face recognition is taken as an example at the server, and it is understood that the face recognition may be performed locally at the user terminal. The CNN used by the server to extract the local feature map may be referred to as a face recognition model, so that the local feature map may be obtained by using the face recognition model, and further, the local feature map is subjected to subsequent processing to obtain a face recognition result.

In the related art, after a local feature map of a face image is obtained, the face recognition result is generally obtained directly by using the local feature map. That is, the face image features are only local features, and only local information is reflected, so that the accuracy of the face recognition method is insufficient.

In this embodiment, not only the local feature of the image but also the global feature of the image is obtained, where the global feature is used to express the inter-region feature of the image. Because the local features only consider the features in the region, the local features can be considered as local information, the global features consider the features between the regions and can be considered as global information, the image features of the embodiment comprise the local features and the global features, reflect the local information and the global information of the image to be identified, improve the expression capability of the image features, and further improve the image identification precision.

Fig. 3 is a schematic diagram according to a third embodiment of the present disclosure. The embodiment provides an image recognition method, taking face recognition as an example, and adopting a face recognition model to obtain a local feature map of a face image as an example, the method of the embodiment includes:

301. and adopting a face recognition model to perform feature extraction processing on the input face image so as to output a local feature map of the face image.

As shown in fig. 4, a CNN model may be used to perform feature extraction processing on a face image, so as to obtain a local feature map of the face image.

The dimensions of the local feature map F are (W, H, C).

Wherein W, H, C is a positive integer, C is the number of channels, for example, 128, W and H are the number of pixels included in the width and height of the local feature map on each channel, for example, w=h=112, i.e., on each channel, the width and height of the local feature map are 112 pixels.

The image recognition method provided by the embodiment of the disclosure can be applied to various image recognition scenes, such as face recognition scenes, animal and plant species recognition, and the like.

Taking a face recognition scene as an example, the image is a face image, the feature extraction processing is performed on the image to obtain a local feature map of the image, including: and adopting a face recognition model to perform feature extraction processing on the input face image so as to output a local feature map of the face image.

That is, when the local feature map is acquired, a face recognition model may be adopted, and the face recognition model may be a CNN model, for example, VGG, restNet, or the like.

Because the face recognition model is usually a deep neural network model, for example, the CNN model has the advantages of high accuracy and the like, and therefore, a more accurate local feature map can be extracted.

302. And acquiring a block feature map based on the local feature map.

The local feature map may be subjected to region blocking processing to obtain a plurality of image blocks; based on the plurality of image blocks, a blocking feature map is determined, the blocking feature map comprising the plurality of image blocks.

Referring to fig. 4, after the local feature map F is acquired, the local feature map F may be subjected to region blocking processing. The size of each image block may be preset, and it is assumed that the size of each image block is represented by w×h, where W and H are positive integers, so the number n= (W// W) ×h// H of image blocks.

Where w and h are preset values, for example, w=h=2; * Representing multiplication;

the term// denotes integer division, and when W and H are set, a value divisible by W and H can be selected.

Assuming w=h=112, w=h=2, then n=56×56 tiles can be divided.

After acquiring the plurality of image blocks, the block feature map T may be acquired based on the plurality of image blocks.

Referring to fig. 4, the dimension of the block feature map T is (N, P, C), where N is the number of image blocks, P is the number of pixels of each image block, i.e., p=w×h, and C is the number of channels. Based on the above example, n=56×56, p=2×2, c=128.

When the block feature map is acquired based on the local feature map, processing can be performed for each channel, so that the number of channels of the local feature map and the block feature map is the same. When processing each channel, the local feature map of w×h may be divided into a plurality of image blocks with size w×h, that is, n= (W// W) ×(H// H) image blocks may be obtained, each image block includes w×h pixel points, for each image block, the w×h pixel points may be converted into w×h dimension vectors, for example, each image block is a matrix of H rows and W columns, and each row may be sequentially arranged in order from the first row, the second row to the H th row, and each row includes W elements, and is converted into w×h dimension vectors, so as to obtain a P dimension vector corresponding to each image block. An image in dimension N x P of C channels may constitute a segmented feature map in dimension (N, P, C).

303. And acquiring a region similarity matrix based on the block feature map.

The region similarity matrix is used for indicating the similarity between the plurality of image blocks, namely the similarity between the regions of each region. As shown in fig. 4, the dimension of the region similarity matrix M is (N, N).

Wherein, based on the block feature map, determining the region similarity matrix may include:

performing first shape conversion (reshape) processing on the block feature map to obtain a first matrix; performing second shape conversion processing on the block feature map to obtain a second matrix; taking the product of the first matrix and the second matrix as the area similarity matrix; the number of rows of the first matrix and the number of columns of the first matrix are the number of the plurality of image blocks.

Specifically, the first matrix may be expressed as: reshape (T, [ N, p×c ]);

the second matrix may be expressed as: reshape (T, [ P x C, N ]);

the reshape (T, [ N, p×c ]) is used to convert the shape of the block feature map T with dimensions (N, P, C) into a matrix with dimensions (N, p×c), i.e. a matrix with the dimensions of N, p×c. Regarding shape conversion, for example, for each row of T, the element values of the pixel points of each channel may be spliced to obtain p×c elements of each row of the first matrix, and after similar processing is performed according to the rows, a first matrix with N rows of p×c columns may be obtained. The second matrix acquisition process is similar to the first matrix acquisition process, so that a second matrix with a behavior p×c and a column N can be acquired.

After the first matrix and the second matrix are acquired, the product of the first matrix and the second matrix can be used as a region similarity matrix. Expressed by the formula:

M＝reshape(T，[N，P*C])*reshape(T，[P*C，N])。

the first matrix of the behavior N and the second matrix of the column N can be obtained respectively through two shape conversion processes on the segmented feature map, and N is the number of the image blocks, so that the regional similarity between every two image blocks in the plurality of image blocks can be calculated based on the first matrix and the second matrix, and the regional similarity between every two image blocks forms a regional similarity matrix, and the regional similarity matrix can express the relation between different regions.

304. And acquiring a weighted feature map based on the region similarity matrix and the block feature map.

After the region similarity matrix is obtained, as shown in fig. 4, the region similarity matrix and the block feature map may be weighted, so as to obtain a weighted feature map B, where the dimension of the weighted feature map B is identical to the dimension of the block feature map T, that is, the dimension of the weighted feature map B is (N, P, C).

Specifically, the region similarity matrix may be subjected to normalization processing to obtain a normalized region similarity matrix; and taking the product of the normalized area similarity matrix and the block feature map as a weighted feature map.

Wherein, the normalization can adopt a softmax function, and the calculation formula of the weighted feature map can be expressed as follows:

B＝∑softmax(M)*T

wherein M is a region similarity matrix, T is a block feature map, and B is a weighted feature map.

softmax is a softmax function and Σ is a sum function.

When the normalized M is a matrix of n×n, for each image block, that is, for each row (or each column) of the normalized M, the element values of the row may be added and then used as the weighting coefficient of the image block corresponding to the row. At the time of multiplication, a weighting coefficient summed by rows may be employed for each channel of T, and elements of the row (total N rows) on the channel of T (total P elements per row) may be multiplied, respectively, so that B having dimensions (N, P, C) may be obtained.

Because M contains similarity information among different areas, B acquired based on M contains information among the areas, namely B contains rich global information, and the connection among the different areas is established.

By acquiring the weighted feature map based on the region similarity matrix and the block feature map, the weighted feature map containing global information can be acquired, and thus the global feature map can be acquired based on the weighted feature map.

305. And acquiring a global feature map based on the weighted feature map.

The weighted feature map may be subjected to a shape conversion (reshape) process, and the weighted feature map after the shape conversion process is used as a global feature map, that is, the weighted feature map B is converted into a feature map having dimensions matching those of the local feature map, and the feature map is used as the global feature map.

The dimensions of the global feature map are (W, H, C).

The global feature map is obtained by performing shape conversion processing on the weighted feature map, and the weighted feature map contains rich global information, so that the global feature map also contains rich global information, the expression capability of image features can be improved, and the image recognition precision can be improved.

In addition, the global feature map is obtained based on the local feature map, specifically, the global feature map is obtained based on the regional segmentation of the local feature map, and the global feature map can be accurately and efficiently obtained.

306. And acquiring a fusion feature map based on the local feature map and the global feature map.

The local feature map and the global feature map may be combined according to a channel dimension, and the combined feature map is used as the fusion feature map.

For example, referring to fig. 4, the dimensions of the local feature map and the global feature map are (W, H, C), and after merging according to the channel dimensions, a merged feature map a with dimensions (W, H,2×c) may be obtained.

The fused feature map may then be used as a final image feature, based on which image recognition may be performed.

Compared with the scheme that the common image features only comprise local features, in the embodiment, the image features are fused with the local features and the global features, so that the expression capability of the image features can be improved, and the image recognition precision is improved.

307. And acquiring a face recognition result of the face image based on the fusion feature map.

After the fusion feature map corresponding to the face image is obtained, the fusion feature map can be converted into a feature vector to be identified; determining the vector similarity between the feature vector to be identified and each candidate feature vector in the pre-stored candidate feature vectors of a plurality of users; and determining the user to which the face image belongs based on the vector similarity.

As shown in fig. 4, the dimension of the fusion feature map a is (W, H, C), and may be flattened (flat) into a vector in dimension w×h×c, which may be referred to as a feature vector to be identified.

The flattening may be implemented by various related techniques, for example, for each channel, the vectors in w×h dimensions may be arranged in rows, and then the vectors in w×h dimensions corresponding to each channel in the C channels may be arranged as vectors in w×h×c dimensions.

The feature vectors of each user of the plurality of users may be pre-stored in the database, and the feature vector of each user may be referred to as a candidate feature vector, for example, there are 1 ten thousand candidate feature vectors of the user in the database.

The vector similarity between the feature vector to be identified and each candidate feature vector may be calculated, where a calculation manner of the vector similarity in various related technologies may be adopted, for example, a cosine similarity between the vectors is calculated, and so on. Then, a face recognition result may be obtained based on the vector similarity, for example, a user corresponding to the candidate feature vector with the highest vector similarity is used as the user to which the recognized face image belongs.

By determining the image recognition result based on the similarity between vectors, the image recognition result can be simply and quickly obtained.

In this embodiment, a local feature map of a face image is obtained by using a face recognition model, a regional block processing is performed on the local feature map to obtain a block feature map, a global feature map is obtained based on the block feature map, fusion processing is performed on the local feature map and the global feature map to obtain a fusion feature map, and a face recognition result is obtained based on the fusion feature map, so that local information and global information of the face image can be combined in a face recognition process, the expression capability of image features of the face image is improved, and further the face recognition accuracy is improved.

The above describes an image recognition procedure, as described above, in which a local feature map may be acquired using an image recognition model. For the training process of the image recognition model, see the following examples.

Fig. 5 is a schematic diagram according to a fifth embodiment of the present disclosure. The embodiment provides a training method of an image recognition model, which comprises the following steps:

501. and carrying out feature extraction processing on an input image sample by adopting an initial image recognition model to obtain local features of the image sample, wherein the local features are used for expressing the features in the region of the image sample.

502. Global features of the image sample are obtained, the global features being used to express inter-region features of the image sample.

503. And acquiring a prediction recognition result of the image sample based on the local feature and the global feature.

504. And constructing a loss function based on the predicted recognition result and the real recognition result corresponding to the image sample.

505. Based on the loss function, parameters of the initial image recognition model are adjusted to generate a final image recognition model.

The execution body of the embodiment may be a training device of an image recognition model, the training device of the image recognition model may be located in an electronic device, the electronic device may be a user terminal or a server, and the user terminal may include: personal computer (Personal Computer, PC), mobile device, smart home device, wearable device, etc., mobile device includes cell phone, portable computer, tablet computer, etc., smart home device includes smart speaker, smart television, etc., and wearable device includes smart watch, smart glasses, etc. The server may be a local server or a cloud server, etc.

The images used in the model training phase may be referred to as image samples, which may be obtained from an existing dataset, such as ImageNet.

In addition, manual labeling or other modes may be adopted in advance to label the image recognition result of the image sample, for example, the user identifier of the user to which the labeled image sample belongs, and the image recognition result labeled in advance may be referred to as a real recognition result corresponding to the image sample.

Similar to the model application stage (i.e., the image recognition process described above), image recognition results, which may be referred to as predictive recognition results, may be obtained after processing the image samples using the image recognition model.

The model parameters of the initial image recognition model may be manually set, or the initial image recognition model may employ an existing pre-trained model.

For the procedure from the image sample to the predictive recognition result, embodiments may be included, and for the specific content, reference may be made to the above-described embodiments regarding the image recognition procedure, which will not be described in detail herein.

In this embodiment, the prediction recognition result may be obtained based on the global feature and the local feature, so that not only the local information of the image sample but also the global information of the image sample may be considered, and therefore, the accuracy of the prediction recognition result may be improved, and further, the accuracy of the image recognition model may be improved.

In some embodiments, the local feature is a local feature map, the global feature is a global feature map, and the acquiring the global feature of the image includes: performing regional blocking processing on the local feature map to obtain a plurality of image blocks; determining a block feature map based on the plurality of image blocks, the block feature map comprising the plurality of image blocks; determining a region similarity matrix based on the block feature map, wherein the region similarity matrix is used for indicating the similarity among the plurality of image blocks; and acquiring the global feature map based on the region similarity matrix and the block feature map.

The global feature map is obtained based on the local feature map, specifically, the global feature map is obtained based on the regional segmentation of the local feature map, and the global feature map can be accurately and efficiently obtained. In addition, since the region similarity matrix can express the inter-region information, a global feature map capable of expressing the inter-region information can be obtained based on the region similarity matrix.

In some embodiments, the determining the region similarity matrix based on the block feature map includes: performing first shape conversion processing on the block feature map to obtain a first matrix; performing second shape conversion processing on the block feature map to obtain a second matrix; taking the product of the first matrix and the second matrix as the area similarity matrix; the number of rows of the first matrix and the number of columns of the first matrix are the number of the plurality of image blocks.

By performing two shape conversion processes on the segmented feature map, a first matrix of a behavior N and a second matrix of a column N can be obtained respectively, wherein N is the number of image blocks, so that the regional similarity between every two image blocks in the plurality of image blocks can be calculated based on the first matrix and the second matrix, and the regional similarity between every two image blocks forms a regional similarity matrix.

In some embodiments, the obtaining the global feature map based on the region similarity matrix and the block feature map includes: normalizing the region similarity matrix to obtain a normalized region similarity matrix; taking the product of the normalized area similarity matrix and the block feature map as a weighted feature map; and carrying out shape conversion processing on the weighted feature map to obtain the global feature map, wherein the dimension of the global feature map is consistent with the dimension of the local feature map.

By acquiring the weighted feature map based on the region similarity matrix and the block feature map, the weighted feature map containing global information can be acquired. The global feature map is obtained by performing shape conversion processing on the weighted feature map, and the weighted feature map contains rich global information, so that the global feature map also contains rich global information, the expression capability of image features can be improved, and the image recognition precision can be improved.

In some embodiments, the local feature is a local feature map, the global feature is a global feature map, and the obtaining a predicted recognition result of the image sample based on the local feature and the global feature includes: carrying out fusion processing on the local feature map and the global feature map to obtain a fusion feature map; and acquiring a prediction recognition result of the image sample based on the fusion feature map.

In some embodiments, the image sample is a face image sample, and the obtaining, based on the fusion feature map, a predicted recognition result of the image sample includes: converting the fusion feature map into feature vectors to be identified; determining the vector similarity between the feature vector to be identified and each candidate feature vector in the pre-stored candidate feature vectors of a plurality of users; and determining user information of a user to which the face image sample belongs based on the vector similarity, and taking the user information as the prediction recognition result.

The user information may be a user identifier, or a probability value that the user is a correct recognition result, or the like. Taking the probability value as an example, the true recognition result is used to indicate whether the user is a correct recognition result, the true recognition result may be 1 (being the correct recognition result) or 0 (not being the correct recognition result), and the predicted recognition result is typically a value of 0 to 1.

By determining the predictive recognition result based on the similarity between vectors, the predictive recognition result can be simply and quickly obtained.

After the predicted recognition result and the real recognition result are obtained, a loss function can be constructed.

For example, referring to fig. 6, after the face image sample is subjected to convolution, region blocking and other processes, a fusion feature map may be obtained, a recognition result may be predicted based on the fusion feature map, and a loss function may be constructed based on the predicted recognition result and the real recognition result, where the loss function is, for example, a classification loss function. The classification loss function may specifically be a cross entropy function.

After the loss function is acquired, the model parameters of the image recognition model may be adjusted based on the loss function, where the adjustment of the model parameters may be performed using a Back Propagation (BP) algorithm or the like. The maximum iteration number can be set in the model parameter adjustment process, and when the adjustment number of the model parameters reaches the maximum iteration number, the model corresponding to the model parameters at the moment is used as a final image recognition model.

Fig. 7 is a schematic diagram of a seventh embodiment of the present disclosure, and the present embodiment provides an image recognition apparatus 700, including: a first acquisition module 701, a second acquisition module 702, and an identification module 703.

The first obtaining module 701 is configured to perform feature extraction processing on an image to obtain local features of the image, where the local features are used to express features in an area of the image; the second obtaining module 702 is configured to obtain global features of the image, where the global features are used to express inter-region features of the image; the recognition module 703 is configured to obtain an image recognition result of the image based on the local feature and the global feature.

In some embodiments, the local feature is a local feature map and the global feature is a global feature map, and the second obtaining module 702 is further configured to: performing regional blocking processing on the local feature map to obtain a plurality of image blocks; determining a block feature map based on the plurality of image blocks, the block feature map comprising the plurality of image blocks; determining a region similarity matrix based on the block feature map, wherein the region similarity matrix is used for indicating the similarity among the plurality of image blocks; and acquiring the global feature map based on the region similarity matrix and the block feature map.

Because the region similarity matrix can express the inter-region information, a global feature map capable of expressing the inter-region information can be obtained based on the region similarity matrix.

In some embodiments, the second obtaining module 702 is further configured to: performing first shape conversion processing on the block feature map to obtain a first matrix; performing second shape conversion processing on the block feature map to obtain a second matrix; taking the product of the first matrix and the second matrix as the area similarity matrix; the number of rows of the first matrix and the number of columns of the first matrix are the number of the plurality of image blocks.

In some embodiments, the second obtaining module 702 is further configured to: normalizing the region similarity matrix to obtain a normalized region similarity matrix; taking the product of the normalized area similarity matrix and the block feature map as a weighted feature map; and carrying out shape conversion processing on the weighted feature map to obtain the global feature map, wherein the dimension of the global feature map is consistent with the dimension of the local feature map.

In some embodiments, the local feature is a local feature map and the global feature is a global feature map, and the identifying module 703 is further configured to: carrying out fusion processing on the local feature map and the global feature map to obtain a fusion feature map; and acquiring an image recognition result of the image based on the fusion feature map.

In some embodiments, the image is a face image, and the recognition module 703 is further configured to: converting the fusion feature map into feature vectors to be identified; determining the vector similarity between the feature vector to be identified and each candidate feature vector in the pre-stored candidate feature vectors of a plurality of users; and determining the user to which the face image belongs based on the vector similarity.

In some embodiments, the first obtaining module 701 is further configured to: and adopting an image recognition model to perform feature extraction processing on the input image so as to output local features of the image.

Because the face recognition model is usually a deep neural network model, for example, the CNN model has the advantages of high accuracy and the like, so that more accurate local features can be extracted.

Fig. 8 is a schematic diagram of an eighth embodiment of the disclosure, where the embodiment provides a training apparatus for an image recognition model, and the apparatus 800 includes: a first acquisition module 801, a second acquisition module 802, a prediction module 803, a construction module 804, and a generation module 805.

The first obtaining module 801 is configured to perform feature extraction processing on an input image sample by using an initial image recognition model, so as to obtain local features of the image sample, where the local features are used to express features in an area of the image sample; the second obtaining module 802 is configured to obtain global features of the image sample, where the global features are used to express inter-region features of the image sample; the prediction module 803 is configured to obtain a predicted recognition result of the image sample based on the local feature and the global feature; the construction module 804 is configured to construct a loss function based on the predicted recognition result and the real recognition result corresponding to the image sample; the generating module 805 is configured to adjust parameters of the initial image recognition model based on the loss function to generate a final image recognition model.

In some embodiments, the local feature is a local feature map and the global feature is a global feature map, and the second obtaining module 802 is further configured to: performing regional blocking processing on the local feature map to obtain a plurality of image blocks; determining a block feature map based on the plurality of image blocks, the block feature map comprising the plurality of image blocks; determining a region similarity matrix based on the block feature map, wherein the region similarity matrix is used for indicating the similarity among the plurality of image blocks; and acquiring the global feature map based on the region similarity matrix and the block feature map.

In some embodiments, the second obtaining module 802 is further configured to: performing first shape conversion processing on the block feature map to obtain a first matrix; performing second shape conversion processing on the block feature map to obtain a second matrix; taking the product of the first matrix and the second matrix as the area similarity matrix; the number of rows of the first matrix and the number of columns of the first matrix are the number of the plurality of image blocks.

In some embodiments, the second obtaining module 802 is further configured to: normalizing the region similarity matrix to obtain a normalized region similarity matrix; taking the product of the normalized area similarity matrix and the block feature map as a weighted feature map; and carrying out shape conversion processing on the weighted feature map to obtain the global feature map, wherein the dimension of the global feature map is consistent with the dimension of the local feature map.

In some embodiments, the local feature is a local feature map and the global feature is a global feature map, and the prediction module 803 is further configured to: carrying out fusion processing on the local feature map and the global feature map to obtain a fusion feature map; and acquiring a prediction recognition result of the image sample based on the fusion feature map.

In some embodiments, the image sample is a face image sample, and the prediction module 803 is further configured to: converting the fusion feature map into feature vectors to be identified; determining the vector similarity between the feature vector to be identified and each candidate feature vector in the pre-stored candidate feature vectors of a plurality of users; and determining user information of a user to which the face image sample belongs based on the vector similarity, and taking the user information as the prediction recognition result.

It can be understood that "first", "second", etc. in the embodiments of the present disclosure are only used for distinguishing, and do not indicate the importance level, the time sequence, etc.

It is to be understood that in the embodiments of the disclosure, the same or similar content in different embodiments may be referred to each other.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the electronic device 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the electronic device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

A number of components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the electronic device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, for example, an image recognition method or a training method of an image recognition model. For example, in some embodiments, the image recognition method or the training method of the image recognition model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the image recognition method or the training method of the image recognition model described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the image recognition method or the training method of the image recognition model in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-chips (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An image recognition method, comprising:

carrying out feature extraction processing on an image to obtain local features of the image, wherein the local features are used for expressing features in an area of the image;

acquiring global features of the image, wherein the global features are used for expressing inter-region features of the image;

acquiring an image recognition result of the image based on the local feature and the global feature;

The local feature is a local feature map, the global feature is a global feature map, and the acquiring the global feature of the image includes:

performing regional blocking processing on the local feature map to obtain a plurality of image blocks;

determining a block feature map based on the plurality of image blocks, the block feature map comprising the plurality of image blocks;

determining a region similarity matrix based on the block feature map, wherein the region similarity matrix is used for indicating the similarity among the plurality of image blocks; wherein, based on the block feature map, determining a region similarity matrix includes: performing first shape conversion processing on the block feature map to obtain a first matrix; performing second shape conversion processing on the block feature map to obtain a second matrix; taking the product of the first matrix and the second matrix as the area similarity matrix; the number of rows of the first matrix and the number of columns of the second matrix are the number of the plurality of image blocks;

acquiring the global feature map based on the region similarity matrix and the block feature map; the dimension of the global feature map is consistent with the dimension of the local feature map, the global feature map is obtained by performing shape conversion on a weighted feature map, and the weighted feature map is obtained by performing weighted calculation on the block feature map and the region similarity matrix.

2. The method according to claim 1, wherein the weighted feature map is obtained by performing weighted calculation on the block feature map and the region similarity matrix, and specifically includes:

normalizing the region similarity matrix to obtain a normalized region similarity matrix;

and taking the product of the normalized area similarity matrix and the block feature map as a weighted feature map.

3. The method of claim 1, wherein the local feature is a local feature map and the global feature is a global feature map, the acquiring an image recognition result of the image based on the local feature and the global feature comprising:

carrying out fusion processing on the local feature map and the global feature map to obtain a fusion feature map;

and acquiring an image recognition result of the image based on the fusion feature map.

4. A method according to claim 3, wherein the image is a face image, and the acquiring an image recognition result of the image based on the fused feature map includes:

converting the fusion feature map into feature vectors to be identified;

determining the vector similarity between the feature vector to be identified and each candidate feature vector in the pre-stored candidate feature vectors of a plurality of users;

And determining the user to which the face image belongs based on the vector similarity.

5. A method according to any one of claims 1-3, wherein said subjecting the image to a feature extraction process to obtain local features of the image comprises:

and adopting an image recognition model to perform feature extraction processing on the input image so as to output local features of the image.

6. A training method of an image recognition model, comprising:

performing feature extraction processing on an input image sample by adopting an initial image recognition model to obtain local features of the image sample, wherein the local features are used for expressing features in a region of the image sample;

acquiring global features of the image sample, wherein the global features are used for expressing inter-region features of the image sample;

based on the local features and the global features, obtaining a prediction recognition result of the image sample;

constructing a loss function based on the predicted recognition result and a real recognition result corresponding to the image sample;

based on the loss function, adjusting parameters of the initial image recognition model to generate a final image recognition model;

determining a region similarity matrix based on the block feature map, wherein the region similarity matrix is used for indicating the similarity among the plurality of image blocks; wherein, based on the block feature map, determining a region similarity matrix includes: performing first shape conversion processing on the block feature map to obtain a first matrix; performing second shape conversion processing on the block feature map to obtain a second matrix; taking the product of the first matrix and the second matrix as the area similarity matrix;

the number of rows of the first matrix and the number of columns of the second matrix are the number of the plurality of image blocks;

7. The method of claim 6, wherein the weighted feature map is obtained by performing weighted calculation on the block feature map and the region similarity matrix, and specifically includes:

8. The method according to any one of claims 6-7, wherein the local feature is a local feature map and the global feature is a global feature map, the obtaining a predictive recognition result of the image sample based on the local feature and the global feature comprises:

and acquiring a prediction recognition result of the image sample based on the fusion feature map.

9. The method of claim 8, wherein the image sample is a face image sample, the obtaining a predicted recognition result of the image sample based on the fused feature map comprises:

converting the fusion feature map into feature vectors to be identified;

and determining user information of a user to which the face image sample belongs based on the vector similarity, and taking the user information as the prediction recognition result.

10. An image recognition apparatus comprising:

the first acquisition module is used for carrying out feature extraction processing on the image so as to acquire local features of the image, wherein the local features are used for expressing the features in the region of the image;

the second acquisition module is used for acquiring global features of the image, wherein the global features are used for expressing inter-region features of the image;

the identification module is used for acquiring an image identification result of the image based on the local feature and the global feature;

wherein the local feature is a local feature map, the global feature is a global feature map, and the second obtaining module is further configured to:

Determining a region similarity matrix based on the block feature map, wherein the region similarity matrix is used for indicating the similarity among the plurality of image blocks;

acquiring the global feature map based on the region similarity matrix and the block feature map; the dimension of the global feature map is consistent with the dimension of the local feature map, the global feature map is obtained by performing shape conversion on a weighted feature map, and the weighted feature map is obtained by performing weighted calculation on the block feature map and the region similarity matrix;

wherein the second acquisition module is further configured to:

performing first shape conversion processing on the block feature map to obtain a first matrix;

performing second shape conversion processing on the block feature map to obtain a second matrix;

taking the product of the first matrix and the second matrix as the area similarity matrix;

the number of rows of the first matrix and the number of columns of the second matrix are the number of the plurality of image blocks.

11. The apparatus of claim 10, wherein the second acquisition module is further to:

12. The apparatus of claim 10, wherein the local feature is a local feature map and the global feature is a global feature map, the identification module further to:

13. The apparatus of claim 12, wherein the image is a face image, the recognition module further to:

converting the fusion feature map into feature vectors to be identified;

14. The apparatus of any of claims 10-13, wherein the first acquisition module is further to:

15. A training device for an image recognition model, comprising:

the first acquisition module is used for carrying out feature extraction processing on an input image sample by adopting an initial image recognition model so as to acquire local features of the image sample, wherein the local features are used for expressing the features in the region of the image sample;

the second acquisition module is used for acquiring global features of the image sample, wherein the global features are used for expressing inter-region features of the image sample;

the prediction module is used for acquiring a prediction recognition result of the image sample based on the local feature and the global feature;

the construction module is used for constructing a loss function based on the prediction recognition result and the real recognition result corresponding to the image sample;

the generation module is used for adjusting parameters of the initial image recognition model based on the loss function so as to generate a final image recognition model;

wherein the second acquisition module is further configured to:

16. The apparatus of claim 15, wherein the second acquisition module is further to:

17. The apparatus of any of claims 15-16, wherein the local feature is a local feature map and the global feature is a global feature map, the prediction module further to:

18. The apparatus of claim 17, wherein the image sample is a face image sample, the prediction module further to:

converting the fusion feature map into feature vectors to be identified;

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-9.