CN116758618B

CN116758618B - Image recognition method, training device, electronic equipment and storage medium

Info

Publication number: CN116758618B
Application number: CN202311030276.XA
Authority: CN
Inventors: 温东超; 梁玲燕; 史宏志; 赵雅倩; 葛沅; 崔星辰; 张英杰
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2023-08-16
Filing date: 2023-08-16
Publication date: 2024-01-09
Anticipated expiration: 2043-08-16
Also published as: CN116758618A

Abstract

The invention provides an image recognition method, a training device, electronic equipment and a storage medium, and relates to the technical field of computer vision, wherein the method comprises the following steps: acquiring an image to be identified; inputting the image to be identified into an image identification model to obtain a feature matrix of the image to be identified, which is output by the image identification model; performing image recognition on the image to be recognized according to the feature matrix to obtain a recognition result of the image to be recognized; the image recognition model is obtained by distillation training based on various types of sample images, a first feature matrix and a first classification result of each sample image output by a teacher model in a training process, and a second feature matrix and a second classification result of each sample image output by a student model in the training process. The invention can furthest compress the model scale, reduce the operation amount and simultaneously furthest improve the face recognition precision.

Description

Image recognition method, training device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer vision, and in particular, to an image recognition method, an image training device, an electronic device, and a storage medium.

Background

Along with the rapid development of the deep learning technology in academia, image recognition products and applications based on the deep learning technology gradually permeate into daily life, and a plurality of convenience is provided for daily life. Therefore, how to construct a high-precision image recognition model is one of the technical problems to be solved in the industry.

Generally, in order to improve the accuracy of an image recognition model in practical application, the performance of the image recognition model is often improved by increasing the number of layers or the complexity of a branch structure of a neural network model when the image recognition model is built. In this scenario, in order to adapt to such an image recognition model, an application server needs to mount a high-performance GPU (Graphics Processing Unit ) board card to complete the model reasoning task, resulting in high image recognition cost.

Disclosure of Invention

The invention provides an image recognition method, a training method, an image recognition device, electronic equipment and a storage medium, which are used for solving the defects that the calculated amount of image recognition is large and the required cost is high due to the fact that the performance of an image recognition model is improved by increasing the layer number or the branching structure complexity of a neural network model in the prior art, and ensuring the image recognition precision while reducing the cost.

The invention provides an image recognition method, which comprises the following steps:

acquiring an image to be identified;

inputting the image to be identified into an image identification model to obtain a feature matrix of the image to be identified, which is output by the image identification model;

performing image recognition on the image to be recognized according to the feature matrix to obtain a recognition result of the image to be recognized;

the image recognition model is obtained by distillation training based on various types of sample images, a first feature matrix and a first classification result of each sample image output by a teacher model in a training process, and a second feature matrix and a second classification result of each sample image output by a student model in the training process.

According to the image recognition method provided by the invention, the image recognition model is trained based on the following steps:

constructing a training data set according to the sample images of various categories and the category labels of each sample image;

dividing the training data set for the current iterative training to obtain training data subsets corresponding to a plurality of batches of training;

for each of the lots of training the following steps are performed:

Respectively inputting a training data subset corresponding to the current batch of training into a student model after the previous batch of training and a teacher model after the previous batch of training to obtain the first classification result, the second classification result, the first feature matrix and the second feature matrix of each sample image in the training data subset corresponding to the current batch of training;

acquiring a classification loss value according to the first classification result, the second classification result and the class label of each sample image in the training data subset corresponding to the current batch training;

acquiring a feature loss value according to the first feature matrix and the second feature matrix of each sample image in the training data subset corresponding to the current batch training;

performing iterative training on the student model after the previous batch training and the teacher model after the previous batch training according to the classification loss value and the characteristic loss value;

and constructing the image recognition model according to the student model trained in the last iteration.

According to the image recognition method provided by the invention, the classification loss value comprises a first classification loss value and a second classification loss value;

The obtaining a classification loss value according to the first classification result, the second classification result and the class label of each sample image in the training data subset corresponding to the current batch training includes:

acquiring the first classification loss value according to the first classification result and the class label of each sample image in the training data subset corresponding to the current batch training;

and acquiring the second classification loss value according to the second classification result and the classification label of each sample image in the training data subset corresponding to the current batch training.

According to the image recognition method provided by the invention, the obtaining the first classification loss value according to the first classification result and the class label of each sample image in the training data subset corresponding to the current batch training comprises the following steps:

based on a preset loss function, calculating a loss value of the first classification result and the class label of each sample image in the training data subset corresponding to the current batch training to obtain the first classification loss value;

wherein the preset loss function comprises an additional corner edge loss function, a cross entropy loss function or an edge cosine loss function.

According to the image recognition method provided by the invention, the obtaining the second classification loss value according to the second classification result and the classification label of each sample image in the training data subset corresponding to the current batch training comprises the following steps:

calculating a loss value of the second classification result and the class label of each sample image in the training data subset corresponding to the current batch training based on a preset loss function to obtain the second classification loss value;

According to the image identification method provided by the invention, the characteristic loss value comprises a positive multi-element loss value and/or a negative multi-element loss value and a distillation loss value;

and performing iterative training on the student model after the previous batch training and the teacher model after the previous batch training according to the classification loss value and the feature loss value, including:

carrying out weighted addition on the positive multi-element loss value and/or the negative multi-element loss value, the distillation loss value and the classification loss value to obtain a target loss value;

And performing iterative training on the student model after the previous batch of training and the teacher model after the previous batch of training based on the target loss value.

According to the image recognition method provided by the invention, the distillation loss value is calculated based on the following steps:

calculating a similarity distance between a first feature matrix of each sample image and a second feature matrix of each sample image in a training data subset corresponding to the current batch training to obtain a first similarity distance;

and determining the distillation loss value according to the first similarity distance.

According to the image recognition method provided by the invention, the positive multiple loss value is calculated based on the following steps:

and executing the following operations on each sample image in the training data subset corresponding to the current batch of training:

acquiring a target feature set corresponding to the category to which the current sample image belongs according to the training result of the current batch training or the training result of the current batch training and the training result of at least one historical batch training before the current batch training;

updating the target feature set according to a first feature matrix of the current sample image;

Calculating the similarity distance between a second feature matrix of the current sample image and each feature matrix in the updated target feature set to obtain a second similarity distance;

and determining the positive multi-element loss value according to the second similarity distance.

According to the image recognition method provided by the invention, the obtaining of the target feature set corresponding to the category to which the current sample image belongs according to the training result of the current batch training or the training result of the current batch training and the training result of at least one historical batch training before the current batch training comprises the following steps:

acquiring a first feature matrix of a first other sample image output by the teacher model after the previous batch of training according to the training result of the current batch of training;

determining the target feature set according to a first feature matrix of a first other sample image output by the teacher model after the previous training; or,

according to the training result of the at least one historical batch training, a first feature matrix of the first other sample images and/or a first feature matrix of the current sample image output by a teacher model corresponding to the at least one historical batch training are obtained;

Determining the target feature set according to the first feature matrix of the first other sample images output by the teacher model after the previous batch training and the first feature matrix of the first other sample images and/or the first feature matrix of the current sample image output by the teacher model corresponding to the at least one historical batch training;

the first other sample images are other sample images except the current sample image in a sample image set corresponding to the category to which the current sample image belongs.

According to the image recognition method provided by the invention, the updating of the target feature set according to the first feature matrix of the current sample image comprises the following steps:

under the condition that the number of the feature matrixes in the target feature set is smaller than a preset value, directly adding the first feature matrix of the current sample image into the target feature set;

under the condition that the number of the feature matrixes in the target feature set is equal to the preset value, determining a target feature matrix according to the update time of each feature matrix in the target feature set;

deleting the target feature matrix from the target feature set, and adding the first feature matrix of the current sample image to the deleted target feature set.

According to the image recognition method provided by the invention, the method for determining the target feature matrix according to the update time of each feature matrix in the target feature set comprises the following steps:

ascending sort is carried out on the update time of each feature matrix in the target feature set;

and taking the feature matrix corresponding to the updating time with the forefront ordering as the target feature matrix.

According to the image recognition method provided by the invention, the negative-to-multiple loss value is calculated based on the following steps:

and executing the following operations on each sample image in the training data subset corresponding to the current batch of training: calculating the similarity distance between the second feature matrix of the current sample image obtained by the current batch training and the first feature matrix of the second other sample images to obtain a third similarity distance;

determining the negative-to-multiple loss value according to the third similarity distance;

the second other sample images are other sample images except the current sample image in the training data subset corresponding to the current batch of training.

According to the image recognition method provided by the invention, the student model comprises a first feature extraction module and a first classification module;

The first feature extraction module is used for extracting features, and the first classification module is used for classifying images.

According to the image recognition method provided by the invention, the image recognition model is constructed according to the student model trained by the last iteration, and the method comprises the following steps:

and constructing the image recognition model according to the first feature extraction module in the student model trained in the last iteration.

According to the image recognition method provided by the invention, the teacher model comprises a second feature extraction module and a second classification module;

the second feature extraction module is used for extracting features, and the second classification module is used for classifying images;

the type of the basic network element of the second feature extraction module is the same as the type of the basic network element of the first feature extraction module; the number of basic network elements of the second feature extraction module is greater than the number of basic network elements of the first feature extraction module, and/or the number of nodes of the basic network elements of the second feature extraction module is greater than the number of nodes of the basic network elements of the first feature extraction module.

According to the image recognition method provided by the invention, the iterative training of the last batch of trained student models and the last batch of trained teacher models according to the classification loss value and the feature loss value comprises the following steps:

combining the classification loss value and the characteristic loss value, and performing iterative optimization on the teacher model trained in the previous batch to obtain a teacher model trained in the current batch;

combining the classification loss value and the characteristic loss value, and performing iterative optimization on the student model trained in the previous batch to obtain a student model trained in the current batch;

respectively inputting a training data subset of the next batch of training into the teacher model after the current batch of training and the student model after the current batch of training, and executing the iterative optimization step until all batches of training are completed, so as to obtain the teacher model after the current iterative training and the student model after the current iterative training;

continuing to divide the training data set, and performing iterative optimization on the teacher model after the current iterative training and the student model after the current iterative training based on a plurality of training data subsets corresponding to the batch training obtained by division until a preset termination condition is met; the preset termination condition includes reaching a maximum number of iterations.

According to the image recognition method provided by the invention, the training data set is divided to obtain training data subsets corresponding to a plurality of batches of training, and the method comprises the following steps:

preprocessing each sample image in the training data set;

randomly adjusting the arrangement sequence of each preprocessed sample image;

dividing the adjusted training data set into training data subsets corresponding to a plurality of batches of training;

the preprocessing includes one or more of image enhancement, image scaling, image channel transformation, image alignment, and normalization.

According to the image recognition method provided by the invention, the image recognition is carried out on the image to be recognized according to the feature matrix to obtain the recognition result of the image to be recognized, and the method comprises the following steps:

matching the feature matrix of the image to be identified with the feature matrix of each reference image in the image library to obtain a reference image matched with the image to be identified;

and acquiring object attribute information corresponding to the image to be identified according to the object attribute information corresponding to the matched reference image.

According to the image recognition method provided by the invention, the model training method is used for training the image recognition model in the image recognition method, and comprises the following steps:

Collecting various types of sample images, a first feature matrix and a first classification result of each sample image output by a teacher model in a training process, and a second feature matrix and a second classification result of each sample image output by a student model in the training process;

and performing distillation training based on the sample images of various categories, a first feature matrix and a first classification result of each sample image output by the teacher model in the training process, and a second feature matrix and a second classification result of each sample image output by the student model in the training process to obtain the image recognition model.

The invention also provides an image recognition device, comprising:

the acquisition unit is used for acquiring the image to be identified;

the feature extraction unit is used for inputting the image to be identified into an image identification model to obtain a feature matrix of the image to be identified, which is output by the image identification model;

the identification unit is used for carrying out image identification on the image to be identified according to the feature matrix to obtain an identification result of the image to be identified;

The invention also provides an image recognition model training device, which is used for training the image recognition model in the image recognition method, and comprises the following steps:

the system comprises an acquisition unit, a training unit and a training unit, wherein the acquisition unit is used for acquiring various types of sample images, a first feature matrix and a first classification result of each sample image output by a teacher model in a training process, and a second feature matrix and a second classification result of each sample image output by a student model in the training process;

the training unit is used for carrying out distillation training to obtain the image recognition model based on the sample images of various categories, a first feature matrix and a first classification result of each sample image output by the teacher model in the training process, and a second feature matrix and a second classification result of each sample image output by the student model in the training process.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the image recognition method as described in any one of the above or the image recognition model training method as described above when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an image recognition method as described in any of the above, or an image recognition model training method as described above.

The invention also provides a computer program product comprising a computer program which when executed by a processor implements an image recognition method as described in any one of the above, or an image recognition model training method as described above.

According to the image recognition method, the training method, the device, the electronic equipment and the storage medium, the feature matrix and the class result of each sample image output by the teacher model in the training process and the feature matrix and the class result of each sample image output by the student model in the training process are collected for distillation training, so that the image recognition model integrating multiple classes and multiple scenes is obtained, and the image recognition is performed based on the image recognition model, so that the human face recognition precision can be improved to the greatest extent while the model scale is compressed to the greatest extent and the operation amount is reduced.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of an image recognition method provided by the invention;

fig. 2 is a schematic diagram of a system structure of an implementation scenario of the image recognition method provided by the present invention;

FIG. 3 is a schematic flow chart of the image recognition model training method provided by the invention;

FIG. 4 is a schematic diagram of an image recognition device according to the present invention;

FIG. 5 is a schematic diagram of the structure of the image recognition model training device provided by the invention;

fig. 6 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Along with the rapid development of the deep learning technology in academia, various artificial intelligence products and applications based on the deep learning technology gradually permeate into aspects of daily life, thereby providing a plurality of convenience for daily life, for example: autopilot, intelligent speakers, automatic translation and ChatGPT (Chat Generative Pre-trained Transformer, generated pre-training transformation model), and the like. Although artificial intelligence products and applications based on deep learning technology provide many convenience for daily life, the artificial intelligence products and applications have the problems of large calculation amount, high resource occupancy rate, large carbon emission and the like. Therefore, how to construct a high-precision image recognition model is one of the technical problems to be solved in the industry.

The main reason for this problem is that high performance or high accuracy deep neural network models typically contain tens or even hundreds of neural network layers and complex branching structures. In practical application, in order to complete specified tasks in real time, such as target detection and scene classification, an application server needs to be equipped with a high-performance GPU board card to complete a model reasoning task, resulting in high image recognition cost. These applications are limited when high performance GPU boards are not configured on the server. Therefore, there is an urgent need in the industry to develop and deploy high-precision image recognition models with low computation.

In this regard, the invention provides a face recognition method, which is based on knowledge distillation to train an image recognition model so as to obtain the image recognition model with high precision, light weight and low calculation amount, thereby realizing the purposes of reducing the calculation amount of the model, reducing the resource occupancy rate and reducing the carbon emission amount and simultaneously effectively ensuring the image recognition precision.

The image recognition method of the present invention is described below with reference to fig. 1 to 2. Fig. 1 is a schematic flow chart of an image recognition method according to an embodiment of the present application; the method can be applied to various image recognition scenes; for example, for face recognition, the image recognition is performed on the server device, and the recognition information is displayed on the client device; alternatively, the image recognition is performed at the client device, the recognition information is stored at the server device, and this embodiment is not particularly limited. Illustratively, when the client device requests image recognition from the server, in response to the request sent by the client device, the server provides some application programming interface (Application Programming Interface, API) to the client device so that the client device can obtain the image recognition information and display it on the client device.

Fig. 2 is a schematic diagram of a system architecture of an implementation scenario of an embodiment of the present application, including a server device and one or more client devices, where the server device communicates with the client devices through a network to provide relevant data for the client devices.

The server device 11 includes, but is not limited to, a web server, a file transfer protocol server, a dynamic host configuration protocol, and the like. The client device 12 may be configured to carry various operating systems, and the client device may be a mobile phone, a tablet computer, a desktop computer, a laptop computer, a handheld computer, a notebook computer, an ultra mobile personal computer, a netbook, a cellular phone, a personal digital assistant, an augmented reality device, a virtual reality device, an artificial intelligence device, a wearable device, a vehicle-mounted device, a smart home device, and/or a smart city device. The network 13 may be a local area network or a wide area network, such as the internet. The network 13 may be implemented using any known network communication protocol, which may be a variety of wired or wireless communication protocols such as ethernet, universal serial bus, global system for mobile communications, general packet radio service, code division multiple access, wideband code division multiple access, time division code division multiple access, long term evolution, bluetooth, wireless fidelity, voice over internet protocol, network slice architecture enabled communication protocols, or any other suitable communication protocol.

It will be appreciated that the structure shown in this embodiment does not constitute a specific limitation on the system of the actual scenario in which the image recognition method is applied. In other embodiments of the present application, the system of the actual scenario may include more or fewer devices than illustrated.

As shown in fig. 1, the method comprises the steps of:

step 101, obtaining an image to be identified;

the image to be identified may be an image required to perform various computer vision processing tasks, such as face recognition, vehicle recognition, object detection, instance segmentation, and image classification, which is not specifically limited in this embodiment. In the following, a face recognition task is taken as an example, and the image recognition method provided in this embodiment is described in a development mode, and is applicable to other tasks as well, and is not described here again.

Optionally, the image to be identified can be obtained by shooting in real time through a camera or an intelligent terminal with a camera, can be obtained by scanning, and can be obtained by transmission or downloading through the internet; the method of acquiring the image to be recognized in this embodiment is not particularly limited.

It will be appreciated that after the image to be identified is obtained, preprocessing may be performed on the image to be identified, including, but not limited to, scale normalization, image alignment, filtering, and the like, so as to improve the efficiency and accuracy of image identification.

102, inputting the image to be identified into an image identification model to obtain a feature matrix of the image to be identified, which is output by the image identification model; the image recognition model is obtained by distillation training based on various types of sample images, a first feature matrix and a first classification result of each sample image output by a teacher model in a training process, and a second feature matrix and a second classification result of each sample image output by a student model in the training process;

the image recognition model is used for recognizing the image so as to recognize the characteristic matrix of the image; the image recognition model may be created based on a network model construction such as a convolutional neural network, a fully-connected network, or the like, which is not particularly limited in this embodiment.

The teacher model is a model with larger and more complex model specifications and better task execution effect compared with the student model, and is used for guiding the student model to train so as to construct and form an image recognition model. The teacher model can transfer knowledge in the teacher model to the student model based on the thought of the teacher-student network, so that the network performance of the student model is improved. Knowledge distillation is the process of transferring knowledge from a large model (i.e., teacher model) to a small model (i.e., student model). To achieve knowledge transfer, knowledge distillation techniques use the output of the student model to mimic the output of the teacher model. The expression capacity of the teacher model is stronger than that of the student model, and the output of the teacher model can be used for guiding the training process of the student model. When the output of the student model gradually approaches to the output of the teacher model, the performance of the student model gradually approaches to the performance of the teacher model, and knowledge of the teacher model is migrated to the image recognition model, so that the calculation cost of the image recognition model in the forward reasoning process is lower, and the image recognition model can be deployed on equipment with weaker calculation capability, so that the performance of the image recognition model is maximally close to the performance of the teacher model while the calculation cost is reduced.

Optionally, before step 102 is performed, a lightweight image recognition model that can accurately perform image recognition needs to be trained in advance, where the image recognition model is obtained by performing distillation learning training based on the following steps:

firstly, in order to improve the image recognition precision, a plurality of sample images of different categories are required to be acquired, and the sample images of different scenes are corresponding to each category, so that a training data set has enough depth and breadth, and an image recognition model obtained through distillation training is used for carrying out image recognition under different scenes.

For face recognition scenes, the depth refers to that the more and the better the number of face images corresponding to each object identifier, and the face images corresponding to each object identifier are images from a plurality of different scenes; the various different scenes include scenes with different illumination changes, occlusion changes, appendages changes, facial pose changes, and age changes. The breadth refers to the better the sample images of the objects of different categories contained in the training set, such as objects with different ethnicities, different sexes, and different complexions.

In addition, a teacher model and a student model which can be used for image recognition are built; in constructing a training framework based on knowledge distillation technology, the feature expression capacity of the teacher model should be higher than that of the student model. Generally, a network with a large calculation amount has a stronger feature expression capability and higher performance, and in order to achieve this, the calculation amount of the teacher model should be larger than that of the student model, that is, the model complexity of the teacher model is higher than that of the student model. In addition, in order to facilitate the student model to better simulate the output or intermediate expression of the learning teacher model, the teacher model and the student model may be configured to have the same or similar basic network structure, such as having a convolution layer, a residual layer and a full-link layer.

Then, training a teacher model and a student model based on sample images of various categories to dynamically collect a first feature matrix and a first classification result of each sample image output by the teacher model in the training process and a second feature matrix and a second classification result of each sample image output by the student model in the training process, and constructing one or more loss functions based on the first feature matrix and the first classification result and the second feature matrix and the second classification result so as to carry out distillation training on the student model, thereby obtaining the image recognition model with the teacher model and the lightweight structure.

Optionally, after the image recognition model is obtained through distillation training, when image feature extraction is required to be performed on the image to be recognized, the image to be recognized is input into the image recognition model obtained through training, so as to obtain a feature matrix of the image to be recognized.

And 103, carrying out image recognition on the image to be recognized according to the feature matrix to obtain a recognition result of the image to be recognized.

Optionally, after the feature matrix of the image to be identified is obtained, matching the feature matrix of the image to be identified with the feature matrix in the image library to obtain an identification result of the image to be identified; the feature matrix may be input to the classification model to obtain the recognition result of the image to be recognized, which is not particularly limited in this embodiment. The recognition result includes, but is not limited to, an object ID, an object type, an object position, and the like in the image to be recognized, which is not particularly limited in this embodiment.

According to the image recognition method, the feature matrix and the class result of each sample image output by the teacher model in the training process and the feature matrix and the class result of each sample image output by the student model in the training process are collected for distillation training, so that the image recognition model integrating multiple classes and multiple scenes is obtained, and the image recognition is carried out based on the image recognition model, so that the face recognition precision can be improved to the greatest extent while the model scale is compressed to the greatest extent and the operand is reduced.

In some embodiments the image recognition model is trained based on the steps of: constructing a training data set according to the sample images of various categories and the category labels of each sample image; dividing the training data set for the current iterative training to obtain training data subsets corresponding to a plurality of batches of training; for each of the lots of training the following steps are performed: respectively inputting a training data subset corresponding to the current batch of training into a student model after the previous batch of training and a teacher model after the previous batch of training to obtain the first classification result, the second classification result, the first feature matrix and the second feature matrix of each sample image in the training data subset corresponding to the current batch of training; acquiring a classification loss value according to the first classification result, the second classification result and the class label of each sample image in the training data subset corresponding to the current batch training; acquiring a feature loss value according to the first feature matrix and the second feature matrix of each sample image in the training data subset corresponding to the current batch training; performing iterative training on the student model after the previous batch training and the teacher model after the previous batch training according to the classification loss value and the characteristic loss value; and constructing the image recognition model according to the student model trained in the last iteration.

The so-called classification result may be a classification category determined based on the classification probability of the sample image output by the model. The class labels can be set according to actual classification scenes, for example, for face recognition scenes, the class labels can be distributed according to the number of different object IDs, so that the sample images of each class of the labels comprise face images of the same object ID, namely, the face images of the same object ID have the same class label. Assuming that the training dataset contains K object IDs, class labels may be assigned from 0 to K-1, or other class label assignment schemes such as from 1 to K may be employed.

Optionally, in order to train a high-precision image recognition model, sample images containing multiple scenes under multiple categories, such as face images containing at least 1 ten thousand different object IDs, are needed in the training data set, wherein the number of face images of each object ID is 20 on average; and in addition, marking the class label of each sample image to obtain the class label of each sample image.

Then, a training data set is constructed according to the sample image and the category label of the sample image. Here, the manner of constructing the training data set includes: directly taking the sample image as a sample, taking a class label of the sample image as a sample label, and constructing a training data set; or the sample image is subjected to data enhancement, the sample image after data enhancement is taken as a sample, the class label of the sample image after data enhancement is taken as a sample label, and a training data set is constructed, and the embodiment is not particularly limited. The category labels of the sample images before and after the data enhancement have the same pseudo labels.

Next, the following steps are performed for each iterative training:

dividing the training data set to obtain training data subsets corresponding to a plurality of batches of training; the specific division mode can be to divide the training data set; or the training data set is preprocessed and then divided, for example, all samples are randomly arranged, namely, the sequence of the samples in the data set is prevented from influencing the performance of the model by scrambling the sequence of the samples, and further the generalization performance of the image recognition model is improved. Wherein the training samples are not repeated for training in each iterative training cycle unless repeated sample images are added to the training data set for sample expansion.

Optionally, the training data set is batched, i.e. divided into a plurality of batched sample sets (hereinafter also referred to as training data subsets), and accordingly, the current iterative training is divided into a plurality of batched training.

The number of samples contained in the batch sample sets herein may be set according to practical requirements, for example, each batch sample set is set to contain 128, 256 or 512 sample images.

It should be noted that, in one iterative training period, if the total number of samples in the training data set cannot be divided into an integer number of batch sample sets, a portion of the samples may be determined to be unused, such as directly discarded, or a portion of the samples may be determined to be used multiple times, so as to supplement the data, such as adding a number of already used samples to the tail of the data set, so that the total number of samples in the training data set may be divided into an integer number of batch sample sets, which is not specifically limited in this embodiment. The training procedure of the image recognition model will be described below by dividing the training data set into columns in batches.

For the current batch of training, a teacher model after the previous batch of training and a student model after the previous batch of training are obtained; if the current batch training is the first batch training, the teacher model after the last iteration training and the student model after the last iteration training are respectively used as the teacher model after the last batch training and the student model after the last batch training.

Next, a current batch sample set (hereinafter also referred to as a training data subset corresponding to the current batch training) is input into the teacher model after the previous batch training and the student model after the previous batch training, so as to obtain a first feature matrix and a first classification result of each sample image in the current batch sample set output by the feature extraction module and the classification module of the teacher model after the previous batch training, and a second feature matrix and a second classification result output by the feature extraction module and the classification module of the student model after the previous batch training.

Then, obtaining at least one classification loss value according to the first classification result, the second classification result and the class label of each sample image in the current batch sample set; and acquiring at least one characteristic loss value according to the first characteristic matrix and the second characteristic matrix of each sample image in the current batch of sample sets. And performing iterative training on the student model after the previous batch training and the teacher model after the previous batch training based on the at least one classification loss value and the at least one feature loss value.

In some embodiments, the performing iterative training on the last batch of trained student models and the last batch of trained teacher models according to the classification loss value and the feature loss value includes: combining the classification loss value and the characteristic loss value, and performing iterative optimization on the teacher model trained in the previous batch to obtain a teacher model trained in the current batch; combining the classification loss value and the characteristic loss value, and performing iterative optimization on the student model trained in the previous batch to obtain a student model trained in the current batch; respectively inputting a training data subset of the next batch of training into the teacher model after the current batch of training and the student model after the current batch of training, and executing the iterative optimization step until all batches of training are completed, so as to obtain the teacher model after the current iterative training and the student model after the current iterative training; continuing to divide the training data set, and performing iterative optimization on the teacher model after the current iterative training and the student model after the current iterative training based on a plurality of training data subsets corresponding to the batch training obtained by division until a preset termination condition is met; the preset termination condition includes reaching a maximum number of iterations.

Optionally, determining a target loss function corresponding to the current batch training by combining at least one classification loss value and at least one feature loss value, and performing iterative optimization on the teacher model after the previous batch training and the student model after the previous batch training based on the target loss value corresponding to the current batch training to obtain the student model of the current batch training and the teacher model of the current batch training.

Here, for the classification loss value, one or more kinds of loss value calculation may be performed according to the first classification result, the second classification result and the class label, so as to obtain at least one classification loss value; for the feature loss values, one or more loss value calculations may be performed according to the first feature matrix and the second feature matrix to directly sum or weight-add the at least one classification loss value and the at least one feature loss value to obtain the target loss value.

And then, according to the training step of the current batch training, continuing to perform iterative optimization on the student model of the current batch training and the teacher model of the current batch training based on the next batch sample set until all batch training is completed, and acquiring the teacher model after the current iterative training and the student model after the current iterative training.

And then, according to the step of the current iteration training, continuously dividing the training data set into a plurality of batch sample sets, and inputting the batch sample sets into the teacher model after the current iteration training and the student model after the current iteration training so as to perform batch iteration optimization on the teacher model after the current iteration training and the student model after the current iteration training until the maximum iteration times are reached. The maximum number of iterations may be set according to the actual requirement, such as 120 iteration cycles.

And finally, constructing an image recognition model according to the student model trained in the last iteration. Specifically, an image recognition model can be constructed according to a feature extraction module for extracting a feature matrix in the student model after the last iteration training.

It should be noted that, the optimization algorithm adopted in the training process may be a random gradient descent optimization algorithm, and the super parameters in the random gradient descent optimization algorithm may be set according to actual requirements, for example, an initial learning rate is set to 0.1, a learning rate at the 30 th period (epoch) is reduced to 0.01, and a learning rate at the 60 th period (epoch) is reduced to 0.001. Momentum was set to 0.9 and weight decay was set to.

According to the method, the sample images of various types are adopted for distillation training, so that the image recognition models fusing various types and various scenes are obtained, and the image recognition is carried out based on the image recognition models, so that the model scale is compressed to the maximum extent, the operation amount is reduced, and meanwhile, the face recognition precision can be improved to the maximum extent.

In some embodiments, the categorical loss values comprise a first categorical loss value and a second categorical loss value; the step of obtaining the classification loss value includes: acquiring the first classification loss value according to the first classification result and the class label of each sample image in the training data subset corresponding to the current batch training; and acquiring the second classification loss value according to the second classification result and the classification label of each sample image in the training data subset corresponding to the current batch training.

Wherein, to avoid the influence of different models using different loss functions on model training, the first classification loss value and the second classification loss value may be function values of the same type of loss function output.

Optionally, for the calculation of the first classification loss value, the first classification result and the class label may be input to a pre-constructed loss function calculation model to calculate the first classification loss value; the first classification result and the class label may be logically calculated by using a pre-configured loss value calculation rule, so as to obtain a first classification loss value, which is not specifically limited in this embodiment.

In some embodiments, the step of obtaining a first classification loss value comprises: based on a preset loss function, calculating a loss value of the first classification result and the class label of each sample image in the training data subset corresponding to the current batch training to obtain the first classification loss value; wherein the preset loss function comprises an additional corner edge loss function, a cross entropy loss function or an edge cosine loss function.

Optionally, calculating a loss value of the first classification result and the class label of each sample image in the training data subset corresponding to the current batch training based on the additional corner edge loss function, the cross entropy loss function or the edge cosine loss function, so as to obtain a first classification loss value.

For example, for the current batch training, when the cross entropy loss function is used to calculate the loss value for the first classification result and the class label, the calculation formula of the first classification loss value is as follows:

；

wherein,n is the number of sample images in the current batch of sample sets for the first classification loss value;tis a teacher model; />For the current batch sample setThe%>Sample image->；/>Output the +.f in the feature extraction module for the teacher model after the previous training >A first feature matrix of the individual sample images; />Is a vector dot product operation; />Is->Class labels corresponding to the individual sample images, +.>；/>The weight matrix of the classification module in the teacher model after the previous training is given; />Is +.>A column; />Is thatIs a transposed vector of (2); k is the number of category labels.

It should be noted that, since the classification module adds the bias term, there is no obvious difference in the model performance, and it is assumed here that the classification module does not include the bias term b, and specifically, the classification module can be generalized to the case of including the bias term according to actual requirements.

For the calculation of the second classification loss value, the second classification result and the class label can be input into a pre-constructed loss function calculation model to calculate and obtain a first classification loss value; the second classification result and the class label may be logically calculated by using a pre-configured loss value calculation rule, which is not specifically limited in this embodiment.

In some embodiments, the step of obtaining the second classification loss value comprises: calculating a loss value of the second classification result and the class label of each sample image in the training data subset corresponding to the current batch training based on a preset loss function to obtain the second classification loss value; wherein the preset loss function comprises an additional corner edge loss function, a cross entropy loss function or an edge cosine loss function.

Optionally, a loss value calculation is performed on the second classification result and the class label of each sample image in the training data subset corresponding to the current batch training based on the additional corner edge loss function, the cross entropy loss function or the edge cosine loss function, so as to obtain a second classification loss value.

For example, for the current batch training, when the cross entropy loss function is used to calculate the loss value for the second classification result and the class label, the calculation formula of the second classification loss value is as follows:

；

wherein,n is the number of sample images in the current batch of sample sets for the second classification loss value;sis a student model; />For the +.f in the current batch sample set>Sample image->；/>Output of feature extraction module for last batch of trained student models>A first feature matrix of the individual sample images; />Is a vector dot product operation; />Is->Class labels corresponding to the individual sample images, +.>；/>The weight matrix of the classification module in the student model after the training of the previous batch is given; />Is the weight matrix in the student model +.>A column;is->Is a transposed vector of (2); k is the number of category labels.

In some embodiments, the characteristic loss values include positive and/or negative-to-multiple loss values, and distillation loss values; the step of training the image recognition model further comprises: carrying out weighted addition on the positive multi-element loss value and/or the negative multi-element loss value, the distillation loss value and the classification loss value to obtain a target loss value; and performing iterative training on the student model after the previous batch of training and the teacher model after the previous batch of training based on the target loss value.

The weight coefficient in the weighted addition may be adaptively set according to actual requirements, or may be calculated by a weight analysis algorithm, such as a principal component analysis algorithm, an entropy method, or the like, which is not specifically limited in this embodiment.

The distillation loss value is used for compressing the distance between a second feature matrix of each sample image output by the student network and a first feature matrix of the sample image output by the teacher model; the method comprises the steps that the distance between a second characteristic matrix for compressing a sample image and a first characteristic matrix in a dynamic characteristic set, wherein the second characteristic matrix is used for compressing the sample image and is output through a student network, and the first characteristic matrix is used for compressing the sample image under the category of the sample image and is output through a teacher model; the negative-to-multiple loss is used to expand the distance between the second feature of the sample image output through the student network and the features of the other samples output through the teacher model.

In some embodiments, the distillation loss value is calculated based on the steps of: calculating a similarity distance between a first feature matrix of each sample image and a second feature matrix of each sample image in a training data subset corresponding to the current batch of training to obtain a first similarity distance; and determining the distillation loss value according to the first similarity distance.

Optionally, performing similarity distance calculation on a first feature matrix and a second feature matrix of each sample image in the current batch of sample sets to obtain a first similarity distance corresponding to each sample image;

and averaging the first similarity distances corresponding to all the sample images in the current batch of sample sets to obtain a distillation loss value. The similarity distance calculation may be a euclidean distance, a mahalanobis distance, or the like, which is not particularly limited in this embodiment.

Taking Euclidean distance as an example, the calculation formula of the distillation loss value is as follows:

；

；/>

wherein,is->Euclidean distance values between a first feature matrix and a second feature matrix of each sample image, wherein N is the number of sample images in a current batch of sample set; />For the current lot sample set +.>The image of the individual sample is taken,；/>and->Respectively outputting the first part of the teacher model after the previous training batch>First feature matrix of each sample image and first +.f output by last batch of trained student model>A second feature matrix of the sample images; />The number of elements in the feature matrix (namely, the first feature matrix or the second feature matrix) is the output dimension of a feature extraction module of the teacher model or the student model; / >Representing the euclidean distance manipulation of the vector.

It should be noted that, the output dimensions of the feature extraction modules of the teacher model and the student model are consistent, for example, 512-dimensional vectors.

In some embodiments, the positive multiple loss value is calculated based on the steps of: and executing the following operations on each sample image in the training data subset corresponding to the current batch of training: acquiring a target feature set corresponding to the category to which the current sample image belongs according to the training result of the current batch training or the training result of the current batch training and the training result of at least one historical batch training before the current batch training; updating the target feature set according to a first feature matrix of the current sample image; calculating the similarity distance between a second feature matrix of the current sample image and each feature matrix in the updated target feature set to obtain a second similarity distance; and determining the positive multi-element loss value according to the second similarity distance.

Optionally, during the model training process, each class of sample images has a corresponding dynamic feature set (hereinafter also referred to as a target feature set), and elements in the set are generated by constructing a first feature matrix of one or more sample images under each class through the output of the teacher model. The dynamic feature set is a queue for storing feature matrices of sample images, and contains a specified number of feature matrices, and the feature matrices in the set are updated in real time during model training.

The process of creating and updating the target feature set of the category to which each sample image belongs is as follows:

acquiring a target feature set corresponding to the category to which the current sample image belongs according to a first feature matrix of each sample image in the training results of the current batch training; or, according to the first feature matrix of each sample image in the training result of the current batch training and the first feature matrix of each sample image in the training result of at least one historical batch training before the current batch training, a target feature set corresponding to the category to which the current sample image belongs is obtained, which is not limited in this embodiment specifically.

And then, dynamically updating the target feature set by the first feature matrix of the current sample image to obtain an updated target feature set.

And then, calculating the similarity distance between the second feature matrix of the current sample image and each feature matrix in the updated target feature set to obtain the second similarity distance corresponding to each feature matrix in the updated target feature set.

Then, all second similarity distances are averaged to obtain a positive multiple loss value. The similarity distance calculation may be a euclidean distance, a mahalanobis distance, or the like, which is not particularly limited in this embodiment.

Taking Euclidean distance as an example, the calculation formula of the over-facing multivariate loss value is as follows:

；

the right facing refers to a combination of a second feature matrix obtained by a sample image through a student network and each feature matrix in a target feature set of the category of the sample image. Assuming that the number of elements in the updated target feature set is M, the total number of faces is M.

For sample image +.>Corresponding positive multiple loss value->For the number of elements in the feature matrix (i.e. the first feature matrix), +.>Sample image output for student model +.>Is>For sample image +.>Belonging to the category->Corresponding +.>The number of feature matrices is chosen such that,m is the number of elements in the updated target feature set; />Training corresponding positive multiple loss values for the current batch; n is the total number of sample images in the current batch of sample sets.

According to the method provided by the embodiment, the second feature matrix of each sample image is opposite to the updated target feature set corresponding to the category to which each sample image belongs, so that the second feature matrix output by a plurality of teacher models formed in the training process is adopted to constrain the first feature matrix output by the student models, on one hand, compared with the features of one teacher model, the features of the plurality of teacher models have expression diversity, negative influence caused by feature mutation can be prevented, and the feature of the student network is constrained by the plurality of feature matrices at the same time, so that a stable result can be obtained; on the other hand, the characteristics of the plurality of teacher models simultaneously restrict the characteristics of the student network, which is equivalent to pulling the characteristics of the student network to the characteristic centers of the class, so that the similarity between the similar image characteristics is as high as possible, and the training efficiency of the image recognition model is accelerated while the accuracy of the image recognition is improved.

In some embodiments, the negative-to-multiple loss value is calculated based on the steps of: and executing the following operations on each sample image in the training data subset corresponding to the current batch of training: calculating the similarity distance between the second feature matrix of the current sample image obtained by the current batch training and the first feature matrix of the second other sample images to obtain a third similarity distance; determining the negative-to-multiple loss value according to the third similarity distance; the second other sample images are other sample images except the current sample image in the training data subset corresponding to the current batch of training.

Optionally, the sample image is imagedFirst feature matrix outputted in student network>With other sample imagesSecond feature matrix outputted in teacher network->Forms negative pairs, shares N-1 negative pairs, and +.>The method comprises the steps of carrying out a first treatment on the surface of the N is the number of sample images in the current training batch sample set.

In the current training batch sample set, the training algorithm needs to ensure each sampleThe image belongs to a category different from that of other samples. Thus, it is desirable to determine negative-to-multiple loss values to facilitate sample images Is>Far from other sample images->Second feature matrix>。

The calculation of the similarity distance may be a euclidean distance, a mahalanobis distance, or the like, and is not particularly limited in this embodiment.

Taking the Euclidean distance as an example, the calculation formula of the corresponding negative-to-multiple loss function of each batch of training is as follows:

；

wherein,for sample image +.>Corresponding negative-to-multiple loss value, +.>For the number of elements in the feature matrix (i.e. the first feature matrix), +.>Sample image output for student model +.>Is provided with a first matrix of features,for sample image +.>Second feature matrix of>And->The method comprises the steps of carrying out a first treatment on the surface of the N is the total number of sample images in the current batch of sample sets. />Training the corresponding negative-to-multiple loss value for the current lot.

Alternatively, in the case that the characteristic loss values include a positive-to-multiple loss value and a negative-to-multiple loss value, and a distillation loss value, the target loss value may be obtained by weighted addition of the positive-to-multiple loss value and the negative-to-multiple loss value, and the distillation loss value, the first classification loss value, and the second classification loss value, and the specific calculation formula is as follows:

；

wherein,for the target loss value, +.>For the first class loss value,/- >For the second class loss value, +.>For distillation loss value, +.>To be opposite to the multiple loss value->Is a negative to multiple loss value->Weight coefficient corresponding to distillation loss value, +.>The weight coefficient corresponding to the sum of the positive multi-element loss value and the negative multi-element loss value; here +.>And->In a certain proportion, such as +.>，/>The present embodiment is not particularly limited, and is specifically set according to actual demands.

For the case that the characteristic loss value includes a positive polynary loss value and a distillation loss value, that is, a negative polynary loss value is not included, the target loss value may be obtained by weighted addition of the positive polynary loss value, the distillation loss value, the first classification loss value and the second classification loss value, and the specific calculation formula is as follows:

；

wherein,for the target loss value, +.>For the first class loss value,/->For the second class loss value, +.>For distillation loss value, +.>To be opposite to the multiple loss value->Weight coefficient corresponding to distillation loss value, +.>The weight coefficient corresponding to the positive multielement loss value; here +.>And->In a certain proportion, such as +.>，/>The present embodiment is not particularly limited, and is specifically set according to actual demands.

In some embodiments, the step of obtaining the target feature set corresponding to the category to which the current sample image belongs in the process of calculating the positive multivariate loss value includes: acquiring a first feature matrix of a first other sample image output by the teacher model after the previous batch of training according to the training result of the current batch of training; determining the target feature set according to a first feature matrix of a first other sample image output by the teacher model after the previous training; or, according to the training result of the at least one historical batch training, acquiring a first feature matrix of the first other sample images and/or a first feature matrix of the current sample image output by the teacher model corresponding to the at least one historical batch training; determining the target feature set according to a first feature matrix of a first other sample image output by the teacher model after the previous batch training and at least one first feature matrix of the first other sample image and/or a first feature matrix of the current sample image output by the teacher model corresponding to the historical batch training; the first other sample images are other sample images except the current sample image in a sample image set corresponding to the category to which the current sample image belongs.

Optionally, under the condition that the number of the first feature matrices of the first other sample images output by the teacher model after the previous batch training is determined to reach the upper limit value of the target feature set according to the training result of the current batch training, a plurality of first feature matrices with newer update time can be determined as the target feature set according to the update time by the first feature matrices of the first other sample images output by the teacher model after the previous batch training; in the case where the number of the first feature matrices of the first other sample images output by the acquired teacher model after the previous batch training is smaller than the upper limit value of the target feature set, a plurality of first feature matrices with relatively new update time may be determined as the target feature set from the first feature matrices of the first other sample images output by the teacher model after the previous batch training and the first feature matrices of the first other sample images and the current sample image in the training results of at least one historical batch training according to the update time, which is not specifically limited in this embodiment.

In the related art, when performing image recognition model distillation training, information (also called knowledge) in a training process of a teacher network is integrated, and the integrated information (also called knowledge) is transferred to a student network. In the training process of the teacher network, a self-attention module is used for adaptively distributing weights to different intermediate models by saving a proper amount of intermediate models, and then knowledge of the intermediate models is integrated through an integration technology and is transferred.

This distillation training approach, in turn, works by maintaining multiple intermediate-stage models of the teacher's network and then migrating/distilling the information of these models to the student's network. Because the models come from different training phases, the models have different feature spaces. The feature space of the teacher network stored in the initial training stage is greatly different from the feature space of the student network in the current training stage, so that a great negative effect can be generated on model learning; in the training process, the models of a plurality of teacher networks are stored, and forward calculation is carried out by using the models, so that the calculated amount and the storage amount are obviously increased, the calculated amount and the storage amount in the training process are larger, and the model training efficiency is reduced. For example, assuming that the dynamic feature set of each category contains M features, the calculation amount increases by about M times in order to obtain M features in such a distillation learning manner.

In contrast, in the manner provided in this embodiment, in the training process, based on the training result of the current batch training, or the training result of the current batch training and the training result of at least one historical batch training before the current batch training, the target feature set corresponding to the category to which the current sample image belongs is dynamically obtained, so that the dynamically obtained target feature set has the following attributes: 1) The dynamically acquired target feature set comprises output features of the teacher network in the current batch training process of the current sample images of the category; 2) The dynamic feature set contains output features of teacher networks in the previous training batch or the historical training batch of the other sample images or the current sample images of the category; 3) The characteristics in the dynamic characteristic set are the characteristics output by a teacher network in the current batch training process or the previous batch training process or the historical batch training process; 4) Because the iterative processes are adjacent to or identical to each other, the variation of the teacher network parameters in the iterative processes is small or identical, and the feature spaces expressed by the teacher network are similar or identical. The features in the dynamically acquired target feature set are similar or identical to the features in the same feature space, so that comparison calculation between the features in the dynamically acquired target feature set and the features generated by the student network in the current iteration process is performed more accurately, knowledge in the dynamic feature set of the category of each sample image (i.e. knowledge of the teacher network) is migrated to the student network in each training batch, regularization effect of the teacher model is improved, positive influence is generated on the training process, and a lightweight image recognition model with higher recognition performance is further acquired efficiently and accurately, so that the method provided by the embodiment has higher calculation efficiency, training stability and image recognition accuracy for the computationally intensive task compared with the method provided by the related technology.

In some embodiments, the step of updating the target feature set corresponding to the category to which the current sample image belongs in the process of computing the positive multivariate loss value includes: under the condition that the number of the feature matrixes in the target feature set is smaller than a preset value, directly adding the first feature matrix of the current sample image into the target feature set; under the condition that the number of the feature matrixes in the target feature set is equal to the preset value, determining a target feature matrix according to the update time of each feature matrix in the target feature set; deleting the target feature matrix from the target feature set, and adding the first feature matrix of the current sample image to the deleted target feature set.

The preset value is an upper limit value of the feature matrix stored in the target feature set, and may be specifically set according to a performance requirement of the model, for example, the preset value is set, which is not limited in this embodiment.

Optionally, in the process of calculating the positive multivariate loss value, the following steps may be adopted to update the target feature set corresponding to the category to which the current sample image belongs:

Comparing the number of the feature matrixes in the target feature set with a preset value to determine whether the number of the feature matrixes in the target feature set reaches an upper limit value, and adding a first feature matrix of the current sample image into the target feature set under the condition that the number of the feature matrixes in the target feature set does not reach the upper limit value; and under the condition that the upper limit value is reached, determining a target feature matrix according to the update time of each feature matrix in the target feature set, so as to delete the target feature matrix from the target feature set, and then adding the first feature matrix of the current sample image into the target feature set.

Here, the determination manner of the target feature matrix may be to determine, as the target feature matrix, a feature matrix whose update time is earlier than a preset time point, or determine, as the target feature matrix, a feature matrix whose update time is earliest, or the like, which is not specifically limited in this embodiment.

In some embodiments, the step of determining the target feature matrix comprises: ascending sort is carried out on the update time of each feature matrix in the target feature set; and taking the feature matrix corresponding to the updating time with the forefront ordering as the target feature matrix.

Optionally, the update time of each feature matrix in the target feature set is sorted in ascending order, that is, the update time is sorted according to the order from early to late, so that the feature matrix corresponding to the update time with the forefront sorting is selected as the target feature matrix according to the sorting result, that is, the feature matrix with the earliest update time is selected as the target feature matrix.

The following specifically describes the process of constructing and updating the target feature set under a certain category by using a specific example:

an empty feature set q= { } is initialized, and an upper limit T of the total number of elements of the set Q is set, e.g., the upper limit t=32 of the total number of elements of the feature set is set.

When the 1 st sample image of the categoryEnter training process, ->Input to teacher network (t) to obtain output characteristic matrix of teacher network>，/>Is added to the feature set->。

When the category is the firstSample image +.>Enter training process, ->Input to teacher network (t) to obtain output characteristic matrix of teacher network>。

Then, the total number n of elements of the current feature set is determined, and the following operations are performed:

if the number of elements n of the feature set is less than the upper limit of the total number of elements T,is added to the feature set；

If the number of elements n of the feature set is equal to the upper limit of the total number of elements T, Is removed (I)>Is added to the feature set->。

According to the method provided by the embodiment, the corresponding feature set of each class is dynamically updated, so that the features in the set come from the features of the same class sample dynamically stored in different iteration periods, the method is not limited to the features of one sample image, the calculation cost required by forward reasoning is reduced, and meanwhile, the method is introduced into the knowledge distillation training process, so that the stability of the model training process can be effectively improved, and the accuracy of the image can be improved.

In some embodiments, the student model includes a first feature extraction module and a first classification module; the first feature extraction module comprises a first backbone network and a first fully-connected network, and the first classification module comprises a second fully-connected network; the first backbone network comprises a plurality of convolution layers and residual layers; the output end of the first backbone network is connected with the input end of the first fully-connected network, and the output end of the first fully-connected network is connected with the input end of the second fully-connected network; the first feature extraction module is used for extracting features, and the first classification module is used for classifying images.

The first feature extraction module is used for extracting features, namely a second feature matrix used for outputting a sample image in the model training process; the first classification module is used for classifying images, namely, outputting a second classification result of the sample images in the model training process.

The student model may be constructed and generated by the first feature extraction module and the first classification module, and the specific structure may be set according to actual requirements, and is not limited to a specific student model structure, for example, a network model formed by improving ResNet18 (18 Deep residual network, including a depth residual network with a network depth of 18 layers). ResNet18 is a basic building block that uses a residual structure.

Optionally, the first backbone network is formed by a stack of residual blocks of multiple layers, each residual block being formed by one or more convolutional layers in jumping connection with the residual network.

The number of the convolution layers contained in the first backbone network may be set according to actual needs, for example, the first backbone network includes 5-layer convolution modules, i.e., conv1, conv2_x, conv3_x, conv4_x, and conv5_x; wherein conv1 is a layer of convolution layer with a convolution kernel of 3x3x3 and a convolution step length of 1; conv2_x is a multi-layer convolution layer (e.g., 4 layers) with a convolution kernel of 3x3x 64; conv3_x is a multi-layer convolution layer whose convolution kernel is 3x3x 128; conv4_x is a multi-layer convolution layer whose convolution kernel is 3x3x256, and conv5_x is a multi-layer convolution layer whose convolution kernel is 3x3x 512.

The input of the first full-connection layer is the output characteristic of conv5_x, and the output dimension of the first full-connection layer can be set according to actual requirements, such as 512-dimensional characteristics. The input of the second full-connection layer is the output characteristic (such as 512-dimensional characteristic) of the first full-connection layer, and the dimension of the output characteristic of the second full-connection layer is the same as the number of categories in the training data set, that is, one dimension corresponds to one category.

It should be noted that, the initial student model is obtained by reading an input network model description file based on one or more input network model description files (including the weight parameters of the network model and the description of the network model), constructing an input neural network model according to the network model description file, and initializing the weight parameters of the constructed input neural network model; the weight parameters of the neural network model are parameters obtained through a model pre-training method or parameters initialized randomly.

In the mode provided by the embodiment, the student model is formed by constructing the multi-layer convolution layer, the residual layer and the full-connection network, so that the image recognition model capable of accurately recognizing the image is efficiently and accurately obtained through the student model.

In some embodiments, the constructing the image recognition model according to the student model trained in the last iteration includes: and constructing the image recognition model according to the first feature extraction module in the student model trained in the last iteration.

Optionally, the first feature extraction module formed by combining the first backbone network and the first fully-connected network in the student model after the last iteration training can be used as an image recognition model, so that the image recognition model has a high-performance image feature extraction function, and the image recognition precision is improved.

In some embodiments, the teacher model includes a second feature extraction module and a second classification module; the second feature extraction module comprises a second backbone network and a third fully-connected network, and the second classification module comprises a fourth fully-connected network; the second backbone network comprises a plurality of convolution layers and residual layers; the output end of the second backbone network is connected with the input end of the third fully-connected network, and the output end of the third fully-connected network is connected with the input end of the fourth fully-connected network; the second feature extraction module is used for extracting features, and the second classification module is used for classifying images; the type of the basic network element of the second feature extraction module is the same as the type of the basic network element of the first feature extraction module; the number of basic network elements of the second feature extraction module is greater than the number of basic network elements of the first feature extraction module, and/or the number of nodes of the basic network elements of the second feature extraction module is greater than the number of nodes of the basic network elements of the first feature extraction module.

The second feature extraction module is used for extracting features, namely a first feature matrix used for outputting a sample image in the model training process; the second classification module is used for classifying the images, namely, outputting a first classification result of the sample images in the model training process.

The teacher model may be constructed and generated by the second feature extraction module and the second classification module, and the specific structure may be set according to actual requirements, and is not limited to the specific teacher model structure, for example, a network model formed by improving the res net50 (50 Deep residual network, including a depth residual network with a network depth of 50 layers). ResNet50 is a basic building block that uses a residual structure.

Optionally, the second backbone network is formed by a stack of residual blocks of multiple layers, each residual block being formed by one or more convolutional layers in jumping connection with the residual network.

The number of the convolution layers contained in the second backbone network may be set according to actual needs, for example, the second backbone network includes 5-layer convolution modules, i.e., conv1, conv2_x, conv3_x, conv4_x, and conv5_x; wherein conv1 is a layer of convolution layer with a convolution kernel of 3x3x3 and a convolution step length of 1; conv2_x comprises a multi-layer convolution layer (e.g., three layers) formed by stacking three layers of convolution units whose convolution kernels are 1x1x64,3x 64, and 1x1x 256; conv3_x comprises a multi-layer convolution layer formed by a stack of three layers of convolution units whose convolution kernels are 1x1x128,3x 128, and 1x1x 512; conv4_x comprises a multi-layer convolution layer formed by a stack of three layers of convolution units whose convolution kernels are 1x1x256,3x 256, and 1x1x 1024; conv5_x comprises a multi-layer convolution layer formed by a stack of three layers of convolution elements whose convolution kernels are 1x1x512,3x 512, and 1x1x 2048.

The input of the third full-connection layer is the output characteristic of conv5_x, and the output dimension of the third full-connection layer can be set according to actual requirements, such as 512-dimensional characteristics. The input of the fourth full-connection layer is the output characteristic (such as 512-dimensional characteristic) of the third full-connection layer, and the dimension of the output characteristic of the fourth full-connection layer is the same as the number of categories in the training data set, that is, one dimension corresponds to one category.

It should be noted that, the initial teacher model is obtained by reading an input network model description file based on one or more input network model description files (including weight parameters of the network model and descriptions of the network model), constructing an input neural network model according to the network model description file, and initializing the weight parameters of the constructed input neural network model; the weight parameters of the neural network model are parameters obtained through a model pre-training method or parameters initialized randomly, such as pre-training the neural network model through a general image classification data set, so as to obtain initial weight parameters of the backbone network.

In the manner provided by the embodiment, a teacher model is formed by constructing a plurality of convolution layers, residual layers and full-connection networks, so that an image recognition model capable of accurately recognizing an image is efficiently and accurately obtained through the teacher model.

In some embodiments, the training data set is divided to obtain training data subsets corresponding to a plurality of batches of training, including: preprocessing each sample image in the training data set; randomly adjusting the arrangement sequence of each preprocessed sample image; dividing the adjusted training data set into training data subsets corresponding to a plurality of batches of training; the preprocessing includes one or more of image enhancement, image scaling, image channel transformation, image alignment, and normalization.

Optionally, one or more of image-shifting, image-flipping, image-rotation, and ray-processing combined image-enhancement techniques may be performed on the historical video sequence to expand the dataset.

For example, the sample image is flipped horizontally with 50% probability to enhance the sample data volume.

The sample images have the same classification labels before and after data enhancement.

In addition, in order to enable the sample image to adapt to the input dimension change of the teacher model and the student model, the sample image can be subjected to image scale transformation and image channel transformation.

For example, the input images of the teacher model and the student model are three channels of a fixed size, such as 112 pixels in image width, 112 pixels in image height, and 3 channels; in determining model inputs, the number of channels and the size of sample images in the training dataset need to be adaptively adjusted according to the input dimensions of the input images of the teacher model and the student model. In the number of channels of the sample image, such as a single-channel gray-scale image or an image in other formats (such as a color-coded image), the number of channels of the sample image needs to be adjusted to adapt to the channel variation of the input images of the teacher model and the student model.

In addition, to ensure normalization of the training dataset, image alignment, and normalization processing may also be performed on the sample images.

The normalization here may be to normalize the pixel values of the sample image to [ -1, +1], as may be done by subtracting 127.5 from each pixel value of the sample image and dividing by 127.5.

And then, randomly adjusting the arrangement sequence of each preprocessed sample image, namely, disturbing the original arrangement sequence of the sample images so as to avoid the influence of the sequence of the sample images in the data set on the performance of the model, and further improving the generalization performance of the model.

In some embodiments, the performing image recognition on the image to be recognized according to the feature matrix to obtain a recognition result of the image to be recognized includes: matching the feature matrix of the image to be identified with the feature matrix of each reference image in the image library to obtain a reference image matched with the image to be identified; and acquiring object attribute information corresponding to the image to be identified according to the object attribute information corresponding to the matched reference image.

Optionally, the image library includes a plurality of reference images corresponding to a plurality of object numbers, and each reference image has a feature moment; a one-to-one mapping relationship is pre-established between each object number and the object attribute. The feature matrix of the reference image may be a feature matrix of the reference image obtained by training the image recognition model in the training step.

Matching the feature matrix of the image to be identified with the feature matrix of each reference image in the image library, obtaining the similarity between the feature matrix of the image to be identified and the feature matrix of each reference image in the image library, selecting the reference image with the highest similarity as the reference image matched with the image to be identified, and taking the object attribute information corresponding to the reference image matched with the image to be identified as the object attribute information of the image to be identified.

The so-called similarity calculation method may be to calculate cosine distances between the feature matrix of the image to be identified and the feature matrix of each reference image in the image library, and determine similarity between the feature matrix of the image to be identified and the feature matrix of each reference image in the image library.

When the image to be identified includes a plurality of images corresponding to the same object identifier, the feature matrix of the plurality of images may be averaged, a cosine distance between the feature matrix average value and the feature matrix of each reference image in the image library may be calculated, and a similarity between the feature matrix of the image to be identified and the feature matrix of each reference image in the image library may be determined.

The image recognition model obtained through distillation training can be directly applied to various image classification tasks, can be simply expanded and applied to other computer vision tasks, and has high adaptability.

As shown in fig. 3, the present embodiment further provides an image recognition model training method, where the model training method is used for training an image recognition model in the image recognition method described in any one of the foregoing, and includes: step 301, collecting sample images of various categories, a first feature matrix and a first classification result of each sample image output by a teacher model in a training process, and a second feature matrix and a second classification result of each sample image output by a student model in the training process; step 302, performing distillation training to obtain the image recognition model based on the sample images of multiple categories, the first feature matrix and the first classification result of each sample image output by the teacher model in the training process, and the second feature matrix and the second classification result of each sample image output by the student model in the training process.

Optionally, the image recognition model is obtained by distillation learning training based on the following steps:

According to the image recognition model training method, the feature matrix and the class result of each sample image output by the teacher model in the training process and the feature matrix and the class result of each sample image output by the student model in the training process are collected to carry out distillation training, so that the image recognition model integrating multiple classes and multiple scenes is obtained, and the image recognition is carried out based on the image recognition model, so that the human face recognition precision can be improved to the greatest extent while the model scale is compressed to the greatest extent and the operation amount is reduced.

It should be noted that, the image recognition model training method and the image recognition method described above may be correspondingly referred to each other, that is, the execution steps of the image recognition method may be referred to for model training and model application, which will not be described herein.

The image recognition apparatus provided by the present invention will be described below, and the image recognition apparatus described below and the image recognition method described above may be referred to correspondingly to each other.

As shown in fig. 4, the present embodiment provides an image recognition apparatus including: the acquiring unit 401 is configured to acquire an image to be identified; the feature extraction unit 402 is configured to input the image to be identified into an image identification model, and obtain a feature matrix of the image to be identified output by the image identification model; the recognition unit 403 is configured to perform image recognition on the image to be recognized according to the feature matrix, so as to obtain a recognition result of the image to be recognized; the image recognition model is obtained by distillation training based on various types of sample images, a first feature matrix and a first classification result of each sample image output by a teacher model in a training process, and a second feature matrix and a second classification result of each sample image output by a student model in the training process.

According to the image recognition device provided by the embodiment, the feature matrix and the class result of each sample image output by the teacher model in the training process and the feature matrix and the class result of each sample image output by the student model in the training process are collected for distillation training, so that the image recognition model integrating multiple classes and multiple scenes is obtained, and the image recognition is performed based on the image recognition model, so that the face recognition precision can be improved to the greatest extent while the model scale is compressed to the greatest extent and the operand is reduced.

As shown in fig. 5, the present embodiment provides an image recognition model training apparatus, which includes: the collecting unit 501 is configured to collect multiple types of sample images, a first feature matrix and a first classification result of each sample image output by a teacher model in a training process, and a second feature matrix and a second classification result of each sample image output by a student model in a training process; the training unit 502 is configured to perform distillation training to obtain the image recognition model based on the sample images of multiple categories, a first feature matrix and a first classification result of each sample image output by the teacher model in the training process, and a second feature matrix and a second classification result of each sample image output by the student model in the training process.

According to the image recognition model training device, the feature matrix and the class result of each sample image output by the teacher model in the training process and the feature matrix and the class result of each sample image output by the student model in the training process are collected for distillation training, so that the image recognition model integrating multiple classes and multiple scenes is obtained, and the image recognition is performed based on the image recognition model, so that the human face recognition precision can be improved to the greatest extent while the model scale is compressed to the greatest extent and the operation amount is reduced.

The apparatus provided in the embodiments of the present invention is used to execute the above embodiments of the method, and specific flow and details refer to the above embodiments, which are not repeated herein.

Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 601, communication interface (Communications Interface) 602, memory 603 and communication bus 604, wherein processor 601, communication interface 602, memory 603 complete the communication between each other through communication bus 604. The processor 601 may invoke logic instructions in the memory 603 to perform the image recognition method provided by the methods described above, the method comprising: acquiring an image to be identified; inputting the image to be identified into the image identification model to obtain a feature matrix of the image to be identified output by the image identification model; performing image recognition on the image to be recognized according to the feature matrix to obtain a recognition result of the image to be recognized; the image recognition model is obtained by distillation training based on various types of sample images, a first feature matrix and a first classification result of each sample image output by a teacher model in the training process, and a second feature matrix and a second classification result of each sample image output by a student model in the training process; or, executing the image recognition model training method provided by the methods, wherein the method comprises the following steps: collecting sample images of various categories, a first feature matrix and a first classification result of each sample image output by a teacher model in a training process, and a second feature matrix and a second classification result of each sample image output by a student model in the training process; and performing distillation training based on the sample images of various categories, the first feature matrix and the first classification result of each sample image output by the teacher model in the training process, and the second feature matrix and the second classification result of each sample image output by the student model in the training process to obtain an image recognition model.

Further, the logic instructions in the memory 603 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of performing the image recognition method provided by the methods described above, the method comprising: acquiring an image to be identified; inputting the image to be identified into the image identification model to obtain a feature matrix of the image to be identified output by the image identification model; performing image recognition on the image to be recognized according to the feature matrix to obtain a recognition result of the image to be recognized; the image recognition model is obtained by distillation training based on various types of sample images, a first feature matrix and a first classification result of each sample image output by a teacher model in the training process, and a second feature matrix and a second classification result of each sample image output by a student model in the training process; or, executing the image recognition model training method provided by the methods, wherein the method comprises the following steps: collecting sample images of various categories, a first feature matrix and a first classification result of each sample image output by a teacher model in a training process, and a second feature matrix and a second classification result of each sample image output by a student model in the training process; and performing distillation training based on the sample images of various categories, the first feature matrix and the first classification result of each sample image output by the teacher model in the training process, and the second feature matrix and the second classification result of each sample image output by the student model in the training process to obtain an image recognition model.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the image recognition method provided by the above methods, the method comprising: acquiring an image to be identified; inputting the image to be identified into the image identification model to obtain a feature matrix of the image to be identified output by the image identification model; performing image recognition on the image to be recognized according to the feature matrix to obtain a recognition result of the image to be recognized; the image recognition model is obtained by distillation training based on various types of sample images, a first feature matrix and a first classification result of each sample image output by a teacher model in the training process, and a second feature matrix and a second classification result of each sample image output by a student model in the training process; or, executing the image recognition model training method provided by the methods, wherein the method comprises the following steps: collecting sample images of various categories, a first feature matrix and a first classification result of each sample image output by a teacher model in a training process, and a second feature matrix and a second classification result of each sample image output by a student model in the training process; and performing distillation training based on the sample images of various categories, the first feature matrix and the first classification result of each sample image output by the teacher model in the training process, and the second feature matrix and the second classification result of each sample image output by the student model in the training process to obtain an image recognition model.

The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product, which may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the various embodiments or methods of some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An image recognition method, comprising:

acquiring an image to be identified;

the image recognition model is obtained by distillation training based on various types of sample images, a first feature matrix and a first classification result of each sample image output by a teacher model in a training process, and a second feature matrix and a second classification result of each sample image output by a student model in the training process;

The image recognition model is trained based on the following steps:

for each of the lots of training the following steps are performed:

constructing the image recognition model according to the student model trained in the last iteration;

the characteristic loss values include positive-to-multiple loss values and/or negative-to-multiple loss values, and distillation loss values;

performing iterative training on the student model after the previous batch training and the teacher model after the previous batch training based on the target loss value;

the distillation loss value is used for compressing the distance between a second feature matrix of each sample image output by the student model and a first feature matrix of each sample image output by the teacher model; the positive multivariate loss is used for compressing the distance between a second feature matrix of each sample image output by the student model and a first feature matrix in a dynamic feature set of each sample image output by the teacher model, wherein the second feature matrix is the same as the first feature matrix; the negative-to-multiple loss is used to expand the distance between the second feature of each sample image output through the student model and the feature of the other samples output through the teacher model.

2. The image recognition method of claim 1, wherein the classification loss value comprises a first classification loss value and a second classification loss value;

3. The method of claim 2, wherein the obtaining the first classification loss value according to the first classification result and the class label of each sample image in the training data subset corresponding to the current lot training comprises:

4. The method of claim 2, wherein the obtaining the second classification loss value according to the second classification result and the class label of each sample image in the training data subset corresponding to the current batch training comprises:

5. The image recognition method according to claim 1, wherein the distillation loss value is calculated based on the steps of:

6. The image recognition method according to claim 1, wherein the positive multiple loss value is calculated based on the steps of:

7. The method for recognizing an image according to claim 6, wherein the obtaining a target feature set corresponding to a category to which a current sample image belongs according to the training result of the current batch training or the training result of the current batch training and the training result of at least one historical batch training before the current batch training includes:

8. The method of image recognition according to claim 6, wherein updating the target feature set according to the first feature matrix of the current sample image comprises:

9. The method of image recognition according to claim 8, wherein determining the target feature matrix according to the update time of each feature matrix in the target feature set comprises:

10. The image recognition method according to claim 1, wherein the negative-to-multiple loss value is calculated based on the steps of:

11. The image recognition method of any one of claims 1-10, wherein the student model comprises a first feature extraction module and a first classification module;

12. The image recognition method according to claim 11, wherein the constructing the image recognition model according to the student model trained in the last iteration includes:

13. The image recognition method of claim 11, wherein the teacher model includes a second feature extraction module and a second classification module;

14. The image recognition method according to any one of claims 1 to 10, wherein the iterative training of the last batch of trained student models and the last batch of trained teacher models according to the classification loss values and the feature loss values includes:

15. The method for image recognition according to any one of claims 1 to 10, wherein the dividing the training data set to obtain training data subsets corresponding to a plurality of batch training includes:

Preprocessing each sample image in the training data set;

randomly adjusting the arrangement sequence of each preprocessed sample image;

16. The image recognition method according to any one of claims 1 to 10, wherein the performing image recognition on the image to be recognized according to the feature matrix to obtain a recognition result of the image to be recognized includes:

17. An image recognition model training method for training an image recognition model in the image recognition method according to any one of claims 1 to 16, comprising:

18. An image recognition apparatus, comprising:

the acquisition unit is used for acquiring the image to be identified;

The image recognition model is trained based on the following steps:

for each of the lots of training the following steps are performed:

19. An image recognition model training apparatus for training an image recognition model in the image recognition method according to any one of claims 1 to 16, comprising:

20. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the image recognition method of any one of claims 1 to 16 or the image recognition model training method of claim 17 when the program is executed by the processor.

21. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the image recognition method according to any one of claims 1 to 16 or the image recognition model training method according to claim 17.