CN112016591A

CN112016591A - Training method of image recognition model and image recognition method

Info

Publication number: CN112016591A
Application number: CN202010772704.6A
Authority: CN
Inventors: 陈嘉敏; 王金桥; 唐明; 胡建国; 招继恩; 朱贵波; 赵朝阳; 林格
Original assignee: Nexwise Intelligence China Ltd
Current assignee: Nexwise Intelligence China Ltd
Priority date: 2020-08-04
Filing date: 2020-08-04
Publication date: 2020-12-01
Also published as: WO2022027987A1

Abstract

The embodiment of the invention provides a training method of an image recognition model and an image recognition method, wherein the training method comprises the following steps: after a first image matrix of a sample picture is recorded, a second image matrix is obtained after segmentation and scrambling; respectively extracting picture features and obtaining picture classification results through corresponding convolutional neural networks; solving a distillation loss function according to the picture characteristics, and solving a classification loss function according to the picture classification result; and optimizing the model by optimizing the distillation loss function and the classification loss function, and finishing the training when the distillation loss function is smaller than a preset first threshold and the classification loss function is smaller than a preset second threshold, so as to obtain the trained image recognition model. The embodiment of the invention is beneficial to realizing local feature capture and extracting more effective features, can achieve the same accuracy as the strong supervision of fine-grained identification without any manual marking information, can reduce the time and space consumption of the algorithm on a model, and improves the robustness.

Description

Training method of image recognition model and image recognition method

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a training method of an image recognition model and an image recognition method.

Background

Fine grain recognition is also called fine recognition. The method is different from the existing general image analysis task, the type of fine-grained image identification is more detailed, the identified granularity is more detailed, and more subdivided subclasses need to be distinguished in a large class to distinguish and identify objects with nuances.

For example, the general image classification only needs to distinguish two object large classes of "bird" and "flower", while the fine-grained image classification needs to distinguish fine-grained subclasses under the category of "flower", i.e. to distinguish whether "rose" or "rose". Thus, fine-grained image recognition requires finding subtle differences between different sub-classes of the same class species, thus greatly increasing its difficulty and challenge.

At present, fine-grained image recognition has a wide application scene in life and industry, and is an important technology which is indispensable in the field of artificial intelligence as an image recognition technology. Meanwhile, the granularity distinguished by the method is finer, so that the fine-grained image recognition technology can greatly improve the existing recognition technology and help to improve the precision of the related upper-layer technology.

The existing fine-grained classification model can be divided into two categories according to the intensity of adopted supervision information: respectively, "classification model based on strong supervision information" and "classification model based on weak supervision information".

Two kinds of additional manual labeling information are introduced into the classification model based on the strong supervision information in the training process, namely a target labeling frame and a key part labeling point. For the two kinds of additional information, the strong supervision classification model can obtain the detection of the foreground object by means of the target marking frame, and the noise interference caused by the background is eliminated; the key part marking points can be used for positioning key points with obvious differences of targets, and local features of the picture can be efficiently extracted from the key points. Therefore, through the positioning provided by the two kinds of additional information, the strong supervision classification model can better extract object information in an accurate place, eliminate the interference caused by irrelevant information on the picture background and other objects, obtain higher accuracy and achieve better effect.

On the contrary, the classification model based on the weak supervision information does not use any additional manual labeling information, and only depends on pictures and the classification labels of the pictures to complete the training and learning of the whole algorithm. The algorithm of the type does not need a large amount of manual investment, and is more convenient and simpler in actual application scenes. In general, the accuracy of classification model algorithms based on weak supervision information is inferior to that of classification model algorithms based on strong supervision information. However, thanks to the development of deep learning in recent years, the classification model algorithm based on weak supervision information introduces a convolutional neural network for training, so that the accuracy of the classification model algorithm is greatly improved, and the classification model algorithm gradually becomes a trend of fine-grained image recognition research.

The key point of the fine-grained identification algorithm is how to dig out nuances in the picture, i.e. the extraction of local features. Fine-grained identification is a challenging task due to the difficulty in finding discriminative features. For a fine-grained identification algorithm of a weak supervision type, the target position and key position points cannot be accurately positioned by means of manual marking information, and only local features can be extracted on the basis of pictures. For a picture, a lot of local features are extracted, and how to eliminate wrong interference features from a plurality of local features and learn useful features is a difficult problem. The existing local feature extraction generally uses an enumeration method, and uses different step sizes or scales to intercept a component region in a full graph, and then proposes features for the component region. However, this method is time-consuming and is susceptible to interference from background information, and thus extracts a large number of region features that are not useful for identification. In addition, different illumination conditions and improper shooting angles of the pictures can also interfere with fine-grained identification of the weak supervision type. In these cases, the fine-grained identification of the weakly supervised type is less accurate and less robust. Therefore, it is still more challenging to achieve better robustness and higher recognition rate for the fine-grained recognition of the weak surveillance type.

Disclosure of Invention

In order to solve the problems in the prior art, embodiments of the present invention provide a training method for an image recognition model and an image recognition method.

In a first aspect, an embodiment of the present invention provides a training method for an image recognition model, including: after a first image matrix of a sample picture is recorded, the sample picture is segmented and disordered, and a second image matrix of the disordered sample picture is obtained; inputting the first image matrix into a first convolution neural network, extracting first picture features through the first convolution neural network and obtaining a first picture classification result; inputting the second image matrix into a second convolutional neural network, extracting second image characteristics through the second convolutional neural network and obtaining a second image classification result; solving a preset distillation loss function according to the first picture characteristic and the second picture characteristic, wherein the smaller the distillation loss function is, the closer the first convolutional neural network and the second convolutional neural network are in the characteristic calculation process is; solving a preset classification loss function according to the first image classification result and the second image classification result, wherein the smaller the classification loss function is, the closer the first convolutional neural network and the second convolutional neural network are to a true value on the classification result is; and optimizing the first convolutional neural network and the second convolutional neural network by continuously optimizing the distillation loss function and the classification loss function, and finishing training when the distillation loss function is smaller than a preset first threshold and the classification loss function is smaller than a preset second threshold, so as to obtain a trained image recognition model constructed by the first convolutional neural network and the second convolutional neural network.

Further, the segmenting and disordering the sample picture specifically includes: firstly, dividing an image into a plurality of image blocks; then, performing the operation of disordering the image blocks in the row direction, and then performing the operation of disordering the image blocks in the column direction; or, firstly, the image blocks in the row and column directions are scrambled, and then the image blocks in the row directions are scrambled.

Further, the performing the operation of scrambling the image blocks in the row direction includes: for each image block in each row, exchanging positions with the image blocks at corresponding positions in the row direction within a preset first step length range according to the value of a first random variable; the performing of the operation of scrambling the image blocks in the column direction includes: and for each image block in each column, exchanging the position of the image block in the column direction with the image block in the corresponding position according to the value of a second random variable within a preset second step length range.

Further, solving a preset distillation loss function according to the first picture characteristic and the second picture characteristic comprises: acquiring a global flow matrix according to the first picture features extracted from two adjacent layers of the convolutional layers in the first convolutional neural network, and acquiring a local flow matrix according to the second picture features extracted from two adjacent layers of the convolutional layers in the second convolutional neural network; solving the preset distillation loss function by calculating the L2 norm distance of the global flow matrix and the local flow matrix.

Further, the expressions of the global stream matrix and the local stream matrix obtained through the picture features of the two adjacent layers are as follows:

wherein, F¹∈R^h×w×mRepresenting a picture characteristic of the upper c1 layer of two adjacent layers, F²∈R^h×w×mThe method comprises the steps of representing picture characteristics of the lower layer c2 in the two adjacent layers, h, W and m respectively represent the height, width and channel number of the picture characteristics, s represents the serial number of the picture height characteristics, t represents the serial number of the picture width characteristics, x represents an input picture, and W represents weight parameters of a neural network.

Further, the distillation loss function is expressed by:

wherein, W_globalRepresenting a global flow matrix, W_localRepresenting a local flow matrix, L_flow(W_global,W_local) Representing a distillation loss function derived from the global flow matrix and the local flow matrix; lambda [ alpha ]₁Representing a weight coefficient; l represents the sequence number of a flow matrix, wherein the flow matrix comprises the global flow matrix and the local flow matrix; n represents the number of the stream matrixes for one picture, wherein the number of the global stream matrixes is the same as that of the local stream matrixes; x represents an input picture; n represents the number of pictures;

an l-th global stream matrix representing x pictures;

the l-th local stream matrix representing x pictures;

representing the L2 norm distance calculation.

In a second aspect, an embodiment of the present invention provides an image recognition method based on the image recognition model, including: after a first image matrix of an input picture is recorded, the input picture is segmented and disordered, and a second image matrix of the disordered input picture is obtained; inputting the first image matrix into the first convolution neural network, and acquiring a first output vector of a full connection layer through the first convolution neural network; inputting the second image matrix into the second convolutional neural network, and acquiring a second output vector of the full-connection layer through the second convolutional neural network; and obtaining a picture identification result according to the first output vector and the second output vector.

Further, obtaining a picture identification result by the first output vector and the second output vector comprises: and adding the first output vector and the second output vector to obtain a third output vector, and obtaining the picture identification result according to the third output vector.

In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method as provided in the first aspect or the second aspect when executing the computer program.

In a fourth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method as provided in the first or second aspect.

According to the training method and the image recognition method of the image recognition model provided by the embodiment of the invention, the image matrix of the original image and the image matrix of the disordered image are respectively input into the two convolutional neural network branches during model training, and the features extracted by the two convolutional neural networks and the classification result are synthesized to carry out learning and training, so that the capturing and extraction of local features are facilitated, more effective features can be obtained, the accuracy same as that of strong supervision fine-grained recognition can be achieved without any manual marking information, the time and space consumption of an algorithm can be reduced on the model, and the system robustness is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart of a training method for an image recognition model according to an embodiment of the present invention;

FIG. 2 is a flowchart of a training method for an image recognition model according to another embodiment of the present invention;

FIG. 3 is a flowchart of an image recognition method according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an image recognition model training apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present invention;

fig. 6 illustrates a physical structure diagram of an electronic device.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a training method of an image recognition model according to an embodiment of the present invention. As shown in fig. 1, the method includes:

step 101, after recording a first image matrix of a sample picture, segmenting and scrambling the sample picture, thereby obtaining a second image matrix of the sample picture after scrambling.

The picture can be characterized by an image matrix, and elements in the image matrix can be gray values of all pixel points. The image recognition model obtained by the training method of the image recognition model provided by the embodiment of the invention can realize the image recognition of the weak supervision fine granularity.

Fine local detail feature representation is the key to fine-grained identification. This is because local details are more important than global structures for fine-grained recognition, since images from different fine-grained classes usually have the same global structure or shape, but only the local details are different. The pictures are disorganized and recombined, so that the algorithm discards the global structure information and retains the local detail information, and the attention of the model network is forced to be focused on the distinctive local regions for identification. The picture scrambling step effectively destroys the global structure, and at this time, in order to identify these randomly scrambled images, the classification network must find and learn the identifiable local regions. Such operations force neural networks to focus on the details in the picture.

The training method of the image recognition model provided by the embodiment of the invention combines the picture original image and the intended picture for training. Therefore, before the sample picture is disturbed, a first image matrix of the sample picture needs to be stored in advance, and the first image matrix is an image matrix before the sample picture is disturbed. And then, the sample picture is segmented and scrambled, so that a second image matrix of the scrambled sample picture is obtained, and the second image matrix is the image matrix of the scrambled sample picture.

Step 102, inputting the first image matrix into a first convolutional neural network, extracting a first picture characteristic through the first convolutional neural network and obtaining a first picture classification result; and inputting the second image matrix into a second convolutional neural network, extracting second image characteristics through the second convolutional neural network and obtaining a second image classification result.

The embodiment of the invention adopts a convolutional neural network for learning and training, and comprises two convolutional neural networks, wherein the input of the first convolutional neural network is a first image matrix of an original picture, and the input of the second convolutional neural network is a second image matrix of a disordered picture.

Thus, the feature extraction part is divided into two branches, global feature extraction and local feature extraction, respectivelyAnd (5) feature extraction. The infrastructure used by the two branches is the same, e.g., both can use resnet50 to extract features. The difference is that the local feature is that the disordered picture phi (I) passes through a first convolutional neural network, which can also be called a convolutional neural network f_localThe global feature is obtained by passing the original image through a second convolutional neural network, also called convolutional neural network f_globalAnd respectively obtaining a global feature classification result (a first picture classification result) and a local feature classification result (a second picture classification result) by the extracted global feature (a first picture feature) and the extracted local feature (a second picture feature) through a full connection layer.

Step 103, solving a preset distillation loss function according to the first picture characteristic and the second picture characteristic, wherein the smaller the distillation loss function is, the closer the first convolutional neural network and the second convolutional neural network are in the characteristic calculation process is; and solving a preset classification loss function according to the first image classification result and the second image classification result, wherein the smaller the classification loss function is, the closer the first convolutional neural network and the second convolutional neural network are to a true value on the classification result is.

For the two feature streams obtained above (first and second picture features), the knowledge distillation step is completed using the intermediate features of the layers in the two convolutional neural networks. The idea of Knowledge Distillation (KD) concept, which was first proposed by Hinton and mostly used in convolutional neural networks, is how to perform knowledge transformation technology, i.e., extract knowledge from a perfect teacher neural network to train a student network, so that students can improve recognition accuracy while keeping model parameters small. However, this method has its limitations, and it is difficult to optimize a neural network having a deep depth. The fish does not teach the fish, the embodiment of the invention provides a new knowledge distillation algorithm, the characteristics of a teacher network are not directly learned, but the knowledge distillation algorithm is converted into a process of learning teacher network characteristic calculation, so that the depth constraint of a neural network model can be skipped, the relatively good universality is achieved, and the recognition degree and the performance of the model can be well improved when the computer vision is difficult to perform in the process of recognizing fine granularity.

Therefore, in the embodiment of the present invention, a preset distillation loss function is solved according to the first picture feature and the second picture feature, and a smaller distillation loss function indicates that the first convolutional neural network and the second convolutional neural network are closer in feature calculation flow; and solving a preset classification loss function according to the first image classification result and the second image classification result, wherein the smaller the classification loss function is, the closer the first convolutional neural network and the second convolutional neural network are to a true value on the classification result is. Wherein the classification loss function may be expressed as a difference between a sum of output vectors of the first convolutional neural network and the second convolutional neural network and a true value.

For the input image I and the scrambled image phi (I), the input image I and the scrambled image phi (I) are respectively subjected to global feature extraction to obtain a convolutional neural network f_globalAnd local feature extraction convolutional neural network f_localThe corresponding global feature output vector C (I) and local feature output vector C (phi (I)) are obtained. Thus, the classification loss function can be defined as:

where l represents the true value of the classification of the image, log represents the logarithmic function,

representing a collection of pictures.

And 104, optimizing the first convolutional neural network and the second convolutional neural network by continuously optimizing the distillation loss function and the classification loss function, and finishing training when the distillation loss function is smaller than a preset first threshold and the classification loss function is smaller than a preset second threshold, so as to obtain a trained image recognition model constructed by the first convolutional neural network and the second convolutional neural network.

The smaller the distillation loss function and the classification loss function, the more optimal the model is. And the distillation loss function and the classification loss function are continuously reduced by feeding back the neural network, so that the model is gradually optimized. And finishing the training when the distillation loss function is smaller than a preset first threshold and the classification loss function is smaller than a preset second threshold, so as to obtain the trained image recognition model.

The training method provided by the embodiment of the invention is generally divided into two parts: the damage recombination part and the knowledge distillation part realize the ordered disordering of the pictures, damage the structural information in the pictures and ensure that the algorithm extracts more precise local information; the knowledge distillation part distills and concentrates the extracted features of the damaged picture, extracts the most effective features for improving the model recognition rate, and further improves the accuracy of the algorithm. Wherein the knowledge distillation part may include a process of model optimization using a distillation loss function and a classification loss function.

According to the embodiment of the invention, the image matrix of the original image and the image matrix of the disordered image are respectively input into the two convolutional neural network branches, and the features extracted by the two convolutional neural networks and the classification result are synthesized to carry out learning and training, so that the capturing and extraction of local features are facilitated, more effective features can be obtained, the accuracy same as that of strong supervision fine-grained identification can be achieved without any manual marking information, the time and space consumption of an algorithm can be reduced on a model, and the system robustness is improved.

Further, based on the above embodiment, the segmenting and scrambling the sample picture specifically includes: firstly, dividing an image into a plurality of image blocks; then, performing the operation of disordering the image blocks in the row direction, and then performing the operation of disordering the image blocks in the column direction; or, firstly, the image blocks in the row and column directions are scrambled, and then the image blocks in the row directions are scrambled.

When the sample picture is cut and scrambled, the sample picture is firstly cut and then scrambled. When dividing, the image is divided into a plurality of image blocks, such as M × N image blocks. And disordering the image blocks after the segmentation is finished. The image blocks in the row direction can be scrambled firstly, and then the image blocks in the column direction can be scrambled; or the operation of scrambling the image blocks in the row and column directions can be performed first, and then the operation of scrambling the image blocks in the row directions can be performed.

On the basis of the embodiment, the embodiment of the invention performs the image block disordering in the row and column directions in sequence after the picture is segmented, thereby improving the flexibility and the orderliness of the system.

Further, based on the above embodiment, the performing the operation of scrambling the image blocks in the row direction includes: for each image block in each row, exchanging positions with the image blocks at corresponding positions in the row direction within a preset first step length range according to the value of a first random variable; the performing of the operation of scrambling the image blocks in the column direction includes: and for each image block in each column, exchanging the position of the image block in the column direction with the image block in the corresponding position according to the value of a second random variable within a preset second step length range.

The idea of destruction and recombination provided by the embodiment of the invention is how to effectively destroy the picture, so that the structural information of the picture is disordered and the local information of the picture is highlighted. The sample picture is divided into different image blocks, and the essence is that the first image matrix is divided into different block matrixes. The method is characterized in that the picture is disordered in an orderly and controllable manner, namely, a block matrix of the picture is replaced in a controllable range, so that the noise introduced by the disordered operation is controlled, and the local characteristics of the picture can be highlighted.

Specifically, the moving step size of the image block may be limited. For example, the moving step size of the image block in the row direction may be set within the first step size range. The first moving step may be represented by a first random variable, which may be different values as each image block moves, but all within the first moving step. The moving step size of the image block in the column direction may be set within the second step size range. The second moving step may be represented by a second random variable, which may be different values but all within the second moving step as each image block moves. And when each image block moves, the position of each image block is exchanged with the position of the corresponding image block.

Of course, in case of a square picture, the picture may be sliced into N × N blocks, i.e., having the same number of blocks in the row direction and the column direction. In the moving, the movement of the image block in the row direction and the column direction may also be set to a uniform step size. Taking this as an example, a method of disturbing pictures will be further described:

the picture scrambling step can be divided into two sub-operations: cutting and disturbing. Firstly, an input image is divided into local small blocks, and then a random algorithm is used for scrambling the small blocks, so that a scrambled picture can be obtained. The specific operation is as follows:

for an input image I, the image is first divided into N × N sub-regions R_i,jWhere i and j are the corresponding row block number and column block number, respectively. The algorithm shuffles the cut sub-regions by the following mechanism: for the region of line j, the algorithm first generates a vector q of size N_jOf the ith element q_j,iI + r, where k is an adjustable parameter of the algorithm (1 ≦ k < N), which characterizes the range perturbed by the perturbation mechanism. By such a scrambling mechanism, new sequences can be obtained

The variation range of each element is as follows:

through the above operation, the operation of disordering the rows of the picture can be completed. Similar rules for column scrambling after row scrambling result in the following relationships:

after the input picture is subjected to row scrambling and column scrambling, a scrambled picture phi (I) is obtained, and the value of the sub-region sigma (I, j) can be expressed as:

the picture scrambling step effectively destroys the global structure, and at this time, in order to identify these randomly scrambled images, the classification network must find and learn the identifiable local regions. Such an operation forces the neural network to focus on details in the picture and ensures, by means of the parameter k, that the selection of a local region is dithered in a neighboring region, thereby controlling the noise introduced by the scrambling operation and highlighting the local features of the picture.

On the basis of the above embodiments, the embodiments of the present invention utilize the random variable of the preset threshold to perform the scrambling of the image blocks in the row and column directions, and ensure that the local area shakes in the neighboring area on the basis of highlighting the local feature, thereby controlling the noise introduced by the scrambling operation.

Further, according to the above embodiment, the solving a preset distillation loss function according to the first picture characteristic and the second picture characteristic includes: acquiring a global flow matrix according to the first picture features extracted from two adjacent layers of the convolutional layers in the first convolutional neural network, and acquiring a local flow matrix according to the second picture features extracted from two adjacent layers of the convolutional layers in the second convolutional neural network; solving the preset distillation loss function by calculating the L2 norm distance of the global flow matrix and the local flow matrix.

When a preset distillation loss function is solved according to the first picture characteristic and the second picture characteristic, a global flow matrix is obtained according to the first picture characteristic extracted from two adjacent layers of the convolutional layers in the first convolutional neural network, and the global flow matrix reflects the change relationship of the characteristics between the two adjacent layers of the convolutional layers in the first convolutional neural network; acquiring a local flow matrix according to the second picture characteristics extracted from two adjacent layers of the convolutional layer in the second convolutional neural network, wherein the local flow matrix reflects the variation relationship of the characteristics between the two adjacent layers of the convolutional layer in the second convolutional neural network; solving the preset distillation loss function by calculating the L2 norm distance of the global flow matrix and the local flow matrix. The L2 norm distance represents the closeness of the feature changes of two adjacent layers of the two convolutional neural networks, so that the smaller the L2 norm distance is, the smaller the distillation loss function value is, and the closer the feature changes of the two adjacent layers of the two convolutional neural networks are represented.

The new knowledge distillation algorithm provided by the embodiment of the invention is also called a flow matrix distillation method, the change relation of the characteristics between each layer of the two networks is obtained by calculating the flow matrix between the two networks, and the student networks can learn the 'solution' of the teacher network calculation characteristics by mutual approaching and fusion between the two flow matrices, so that the accuracy of fine-grained identification is improved. In the algorithm flow provided by the embodiment of the invention, the roles of the teacher network and the student network are not strictly divided, but the knowledge distillation effect is achieved by mutual approaching and mutual fusion between the global feature extraction network (the first convolutional neural network) and the local feature extraction network (the second convolutional neural network).

By continuously optimizing the loss function (including the classification loss function of the distillation loss function), the embodiment of the invention can continuously fuse the global features and the local features extracted from the picture, and perform mutual fusion, mutual distillation and refinement. The process can extract the features which help the model identification rate to be larger, the accuracy of fine-grained identification is improved better, and the noise caused by disordered pictures can be eliminated in the mode. Meanwhile, the flow matrix distillation method enables the flow matrix distillation method to have good model generalization by learning the change process of the characteristics between the two networks, can overcome the limitation of knowledge distillation, and can be perfectly executed even in the face of a deep neural network.

On the basis of the embodiment, the flow matrix distillation method is adopted, the change process of the characteristics between the two networks is learned, so that the method has better model generalization, the limitation of knowledge distillation can be overcome, and the method can be perfectly executed even in the face of a deep neural network.

Further, based on the above embodiment, the expressions of the global stream matrix and the local stream matrix obtained through the picture features of the two adjacent layers are as follows:

For a teacher's network, the goal is to learn the process of feature changes in its network, i.e., the relationship between the features obtained from two adjacent layers in the network. Thus defining the flow matrix G ∈ R^m×nComprises the following steps:

knowledge distillation can be achieved by calculating the flow matrix of the first convolutional neural network and the flow matrix of the second convolutional neural network respectively and continuously optimizing the L2 norm distance between the first convolutional neural network and the second convolutional neural network.

On the basis of the above embodiments, the embodiments of the present invention improve the practicability by giving the expression of the flow matrix.

Further, based on the above embodiment, the distillation loss function is expressed as:

wherein, W_globalRepresenting a global flow matrix, W_localRepresenting a local flow matrix, L_flow(W_global,W_local) The representation is based on a global stream matrix and a local stream matrixThe distillation loss function to; lambda [ alpha ]₁Representing a weight coefficient; l represents the sequence number of a flow matrix, wherein the flow matrix comprises the global flow matrix and the local flow matrix; n represents the number of the stream matrixes for one picture, wherein the number of the global stream matrixes is the same as that of the local stream matrixes; x represents an input picture; n represents the number of pictures;

an l-th global stream matrix representing x pictures;

the l-th local stream matrix representing x pictures;

representing the L2 norm distance calculation.

Firstly, respectively calculating a global flow matrix G of a global feature extraction network_global(x；W_global) And local flow matrix G of local feature extraction network_local(x；W_local) Then calculating the knowledge distillation loss function L_flow(W_global,W_local). Since one stream matrix can be calculated from two layers, there are a plurality of stream matrices corresponding to one picture. The L2 norm distances of the flow matrix for each picture are integrated to yield the distillation loss function as above. In the embodiment of the present invention, it is considered that each flow matrix is equally important, and therefore the same weight coefficient λ may be used in the loss function₁。

On the basis of the above embodiments, the embodiments of the present invention obtain the distillation loss function by synthesizing the L2 norm distances of the flow matrix of each picture, thereby improving the reliability of the distillation loss function.

Fig. 2 is a flowchart of a training method of an image recognition model according to another embodiment of the present invention. As shown in FIG. 2, the embodiment of the present invention provides a training method for an image recognition model based on destructive recombination and knowledge distillation, which can achieve the same accuracy as the highly supervised fine-grained recognition without any manually labeled information, and can reduce the time and space consumption of the algorithm on the model. The method is generally divided into two parts: the damage recombination part and the knowledge distillation part realize the ordered disordering of the pictures, damage the structural information in the pictures and ensure that the algorithm extracts more precise local information; the knowledge distillation part distills and concentrates the extracted features of the damaged picture, extracts the most effective features for improving the model recognition rate, and further improves the accuracy of the algorithm.

Firstly, the algorithm carries out a picture destruction step, and orderly scrambles the pictures, namely, the disturbance amplitude is controlled while the pictures are scrambled, so that the effect of effectively controlling noise introduced by the scrambling is achieved. Through the steps, the original structural information of the picture is damaged, an algorithm is forced to pay attention to the local information points in the picture, and more effective and accurate local information is extracted.

After the disruption recombination section is completed, the algorithm enters a knowledge distillation section, which is completed by two branches together. The method comprises the steps that local features and global features of a disordered picture and an original picture are extracted through a convolutional neural network respectively, then local classification results and global classification results are obtained through a full connection layer, meanwhile, a local flow matrix and a global flow matrix required by an algorithm are calculated according to calculation results of layers of the convolutional neural networks on two sides, then the extracted features are distilled and concentrated through a knowledge distillation algorithm, the features which are most effective in improving the model recognition rate are further obtained, parameter adjustment of the convolutional neural network is facilitated, the algorithm can fuse the global features and the local features to classify fine grains of the picture, and fine grain recognition accuracy is effectively improved.

Fig. 3 is a flowchart of an image recognition method according to an embodiment of the present invention. The method can be used for image recognition by applying the image recognition model obtained by training in any embodiment. The method comprises the following steps:

step 201, after recording a first image matrix of an input picture, segmenting and scrambling the input picture, thereby obtaining a second image matrix of the scrambled input picture.

After the first image matrix of the input image is recorded, the input image can be segmented and scrambled according to the rules of image segmentation and scrambling during model training, so that the second image matrix of the scrambled input image is obtained. Different from the sample pictures during training, the first image matrix in the embodiment of the invention corresponds to the input pictures which need to be identified actually, and the second image matrix corresponds to the input pictures after disorder.

Step 202, inputting the first image matrix into the first convolutional neural network, and acquiring a first output vector of a full connection layer through the first convolutional neural network; and inputting the second image matrix into the second convolutional neural network, and acquiring a second output vector of the full-link layer through the second convolutional neural network.

And inputting the first image matrix into the first convolutional neural network, acquiring a first output vector of the full-connection layer through the first convolutional neural network, wherein the size of each element in the first output vector can represent the probability that the picture is in a corresponding category. And inputting the second image matrix into the second convolutional neural network, acquiring a second output vector of the fully-connected layer through the second convolutional neural network, wherein the size of each element in the second output vector can represent the probability that the picture is in a corresponding category.

And 203, obtaining a picture identification result according to the first output vector and the second output vector.

The first output vector and the second output vector can be synthesized to obtain a picture identification result. For example, the first output vector and the second output vector may be summed in a weighted manner, and the category to which the picture belongs may be determined according to the sizes of the elements in the output vectors.

The embodiment of the invention can realize the image recognition of the weak supervision fine granularity by using the image recognition model obtained by the training method, does not need any manual marking information, and can achieve the same accuracy as the strong supervision fine granularity recognition.

Further, based on the above embodiment, the obtaining a picture identification result according to the first output vector and the second output vector includes: and adding the first output vector and the second output vector to obtain a third output vector, and obtaining the picture identification result according to the third output vector.

When a picture identification result is obtained according to the first output vector and the second output vector, the first output vector and the second output vector can be directly added to obtain a third output vector, and the category of a picture is determined according to the size of an element in the third output vector, so that the picture identification result is obtained.

On the basis of the above embodiment, the embodiment of the present invention obtains the third output vector by adding the first output vector and the second output vector, and obtains the picture recognition result according to the third output vector, thereby improving the simplicity.

Fig. 4 is a schematic structural diagram of an image recognition model training apparatus according to an embodiment of the present invention. As shown in fig. 4, the apparatus includes a picture scrambling module 10, a feature extracting and classifying module 20, a loss function calculating module 30, and a model optimizing module 40, wherein: the picture scrambling module 10 is configured to: after a first image matrix of a sample picture is recorded, the sample picture is segmented and disordered, and a second image matrix of the disordered sample picture is obtained; the feature extraction and classification module 20 is configured to: inputting the first image matrix into a first convolution neural network, extracting first picture features through the first convolution neural network and obtaining a first picture classification result; inputting the second image matrix into a second convolutional neural network, extracting second image characteristics through the second convolutional neural network and obtaining a second image classification result; the loss function calculation module 30 is configured to: solving a preset distillation loss function according to the first picture characteristic and the second picture characteristic, wherein the smaller the distillation loss function is, the closer the first convolutional neural network and the second convolutional neural network are in the characteristic calculation process is; solving a preset classification loss function according to the first image classification result and the second image classification result, wherein the smaller the classification loss function is, the closer the first convolutional neural network and the second convolutional neural network are to a true value on the classification result is; the model optimization module 40 is configured to: and optimizing the first convolutional neural network and the second convolutional neural network by continuously optimizing the distillation loss function and the classification loss function, and finishing training when the distillation loss function is smaller than a preset first threshold and the classification loss function is smaller than a preset second threshold, so as to obtain a trained image recognition model constructed by the first convolutional neural network and the second convolutional neural network.

Fig. 5 is a schematic structural diagram of an image recognition apparatus according to an embodiment of the present invention. As shown in fig. 5, the apparatus includes an image processing module 100, an output vector obtaining module 200, and an image recognition module 300, wherein: the image processing module 100 is configured to: after a first image matrix of an input picture is recorded, the input picture is segmented and disordered, and a second image matrix of the disordered input picture is obtained; the output vector obtaining module 200 is configured to: inputting the first image matrix into the first convolution neural network, and acquiring a first output vector of a full connection layer through the first convolution neural network; inputting the second image matrix into the second convolutional neural network, and acquiring a second output vector of the full-connection layer through the second convolutional neural network; the image recognition module 300 is configured to: and obtaining a picture identification result according to the first output vector and the second output vector.

The device provided by the embodiment of the present invention is used for the method, and specific functions may refer to the above method flow, which is not described herein again.

Fig. 6 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 6: a processor (processor)610, a communication Interface (Communications Interface)620, a memory (memory)630 and a communication bus 640, wherein the processor 610, the communication Interface 620 and the memory 630 communicate with each other via the communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a method of training an image recognition model, the method comprising: after a first image matrix of a sample picture is recorded, the sample picture is segmented and disordered, and a second image matrix of the disordered sample picture is obtained; inputting the first image matrix into a first convolution neural network, extracting first picture features through the first convolution neural network and obtaining a first picture classification result; inputting the second image matrix into a second convolutional neural network, extracting second image characteristics through the second convolutional neural network and obtaining a second image classification result; solving a preset distillation loss function according to the first picture characteristic and the second picture characteristic, wherein the smaller the distillation loss function is, the closer the first convolutional neural network and the second convolutional neural network are in the characteristic calculation process is; solving a preset classification loss function according to the first image classification result and the second image classification result, wherein the smaller the classification loss function is, the closer the first convolutional neural network and the second convolutional neural network are to a true value on the classification result is; and optimizing the first convolutional neural network and the second convolutional neural network by continuously optimizing the distillation loss function and the classification loss function, and finishing training when the distillation loss function is smaller than a preset first threshold and the classification loss function is smaller than a preset second threshold, so as to obtain a trained image recognition model constructed by the first convolutional neural network and the second convolutional neural network. Alternatively, the processor 610 may invoke logic instructions in the memory 630 to perform an image recognition method comprising: after a first image matrix of an input picture is recorded, the input picture is segmented and disordered, and a second image matrix of the disordered input picture is obtained; inputting the first image matrix into the first convolution neural network, and acquiring a first output vector of a full connection layer through the first convolution neural network; inputting the second image matrix into the second convolutional neural network, and acquiring a second output vector of the full-connection layer through the second convolutional neural network; and obtaining a picture identification result according to the first output vector and the second output vector.

In addition, the logic instructions in the memory 630 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer can perform the method for training an image recognition model provided by the above-mentioned embodiments of the method, where the method includes: after a first image matrix of a sample picture is recorded, the sample picture is segmented and disordered, and a second image matrix of the disordered sample picture is obtained; inputting the first image matrix into a first convolution neural network, extracting first picture features through the first convolution neural network and obtaining a first picture classification result; inputting the second image matrix into a second convolutional neural network, extracting second image characteristics through the second convolutional neural network and obtaining a second image classification result; solving a preset distillation loss function according to the first picture characteristic and the second picture characteristic, wherein the smaller the distillation loss function is, the closer the first convolutional neural network and the second convolutional neural network are in the characteristic calculation process is; solving a preset classification loss function according to the first image classification result and the second image classification result, wherein the smaller the classification loss function is, the closer the first convolutional neural network and the second convolutional neural network are to a true value on the classification result is; and optimizing the first convolutional neural network and the second convolutional neural network by continuously optimizing the distillation loss function and the classification loss function, and finishing training when the distillation loss function is smaller than a preset first threshold and the classification loss function is smaller than a preset second threshold, so as to obtain a trained image recognition model constructed by the first convolutional neural network and the second convolutional neural network. Or, when the program instructions are executed by a computer, the computer can execute the image recognition method provided by the above method embodiments, and the method comprises: after a first image matrix of an input picture is recorded, the input picture is segmented and disordered, and a second image matrix of the disordered input picture is obtained; inputting the first image matrix into the first convolution neural network, and acquiring a first output vector of a full connection layer through the first convolution neural network; inputting the second image matrix into the second convolutional neural network, and acquiring a second output vector of the full-connection layer through the second convolutional neural network; and obtaining a picture identification result according to the first output vector and the second output vector.

In yet another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented by a processor to perform the method for training an image recognition model provided in the foregoing embodiments, and the method includes: after a first image matrix of a sample picture is recorded, the sample picture is segmented and disordered, and a second image matrix of the disordered sample picture is obtained; inputting the first image matrix into a first convolution neural network, extracting first picture features through the first convolution neural network and obtaining a first picture classification result; inputting the second image matrix into a second convolutional neural network, extracting second image characteristics through the second convolutional neural network and obtaining a second image classification result; solving a preset distillation loss function according to the first picture characteristic and the second picture characteristic, wherein the smaller the distillation loss function is, the closer the first convolutional neural network and the second convolutional neural network are in the characteristic calculation process is; solving a preset classification loss function according to the first image classification result and the second image classification result, wherein the smaller the classification loss function is, the closer the first convolutional neural network and the second convolutional neural network are to a true value on the classification result is; and optimizing the first convolutional neural network and the second convolutional neural network by continuously optimizing the distillation loss function and the classification loss function, and finishing training when the distillation loss function is smaller than a preset first threshold and the classification loss function is smaller than a preset second threshold, so as to obtain a trained image recognition model constructed by the first convolutional neural network and the second convolutional neural network. Or, the computer program is implemented to perform the image recognition method provided by the above embodiments when executed by a processor, and the method includes: after a first image matrix of an input picture is recorded, the input picture is segmented and disordered, and a second image matrix of the disordered input picture is obtained; inputting the first image matrix into the first convolution neural network, and acquiring a first output vector of a full connection layer through the first convolution neural network; inputting the second image matrix into the second convolutional neural network, and acquiring a second output vector of the full-connection layer through the second convolutional neural network; and obtaining a picture identification result according to the first output vector and the second output vector.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A training method of an image recognition model is characterized by comprising the following steps:

after a first image matrix of a sample picture is recorded, the sample picture is segmented and disordered, and a second image matrix of the disordered sample picture is obtained;

inputting the first image matrix into a first convolution neural network, extracting first picture features through the first convolution neural network and obtaining a first picture classification result; inputting the second image matrix into a second convolutional neural network, extracting second image characteristics through the second convolutional neural network and obtaining a second image classification result;

solving a preset distillation loss function according to the first picture characteristic and the second picture characteristic, wherein the smaller the distillation loss function is, the closer the first convolutional neural network and the second convolutional neural network are in the characteristic calculation process is; solving a preset classification loss function according to the first image classification result and the second image classification result, wherein the smaller the classification loss function is, the closer the first convolutional neural network and the second convolutional neural network are to a true value on the classification result is;

and optimizing the first convolutional neural network and the second convolutional neural network by continuously optimizing the distillation loss function and the classification loss function, and finishing training when the distillation loss function is smaller than a preset first threshold and the classification loss function is smaller than a preset second threshold, so as to obtain a trained image recognition model constructed by the first convolutional neural network and the second convolutional neural network.

2. The method for training an image recognition model according to claim 1, wherein the segmenting and disordering the sample picture specifically comprises:

firstly, dividing an image into a plurality of image blocks; then, performing the operation of disordering the image blocks in the row direction, and then performing the operation of disordering the image blocks in the column direction; or, firstly, the image blocks in the row and column directions are scrambled, and then the image blocks in the row directions are scrambled.

3. The method for training an image recognition model according to claim 2, wherein the performing the operation of scrambling the image blocks in the row direction comprises: for each image block in each row, exchanging positions with the image blocks at corresponding positions in the row direction within a preset first step length range according to the value of a first random variable;

the performing of the operation of scrambling the image blocks in the column direction includes: and for each image block in each column, exchanging the position of the image block in the column direction with the image block in the corresponding position according to the value of a second random variable within a preset second step length range.

4. The method for training an image recognition model according to claim 1, wherein solving a preset distillation loss function according to the first picture feature and the second picture feature comprises:

acquiring a global flow matrix according to the first picture features extracted from two adjacent layers of the convolutional layers in the first convolutional neural network, and acquiring a local flow matrix according to the second picture features extracted from two adjacent layers of the convolutional layers in the second convolutional neural network;

solving the preset distillation loss function by calculating the L2 norm distance of the global flow matrix and the local flow matrix.

5. The method for training the image recognition model according to claim 4, wherein the expressions of the global flow matrix and the local flow matrix obtained through the picture features of two adjacent layers are as follows:

wherein, F¹∈R^h×w×mRepresenting a picture characteristic of the upper c1 layer of two adjacent layers, F²∈R^h×w×mDiagram showing the lower c2 layer of two adjacent layersThe slice characteristics h, W and m respectively represent the height, width and channel number of the picture characteristics, s represents the serial number of the picture height characteristics, t represents the serial number of the picture width characteristics, x represents an input picture, and W represents the weight parameter of the neural network.

6. The method for training an image recognition model according to claim 5, wherein the distillation loss function is expressed by:

wherein, W_globalRepresenting a global flow matrix, W_localRepresenting a local flow matrix, L_flow(W_global，W_local) Representing a distillation loss function derived from the global flow matrix and the local flow matrix; lambda [ alpha ]₁Representing a weight coefficient; l represents the sequence number of a flow matrix, wherein the flow matrix comprises the global flow matrix and the local flow matrix; n represents the number of the stream matrixes for one picture, wherein the number of the global stream matrixes is the same as that of the local stream matrixes; x represents an input picture; n represents the number of pictures;

an l-th global stream matrix representing x pictures;

the l-th local stream matrix representing x pictures;

representing the L2 norm distance calculation.

7. An image recognition method based on the image recognition model of any one of claims 1 to 6, comprising:

after a first image matrix of an input picture is recorded, the input picture is segmented and disordered, and a second image matrix of the disordered input picture is obtained;

inputting the first image matrix into the first convolution neural network, and acquiring a first output vector of a full connection layer through the first convolution neural network; inputting the second image matrix into the second convolutional neural network, and acquiring a second output vector of the full-connection layer through the second convolutional neural network;

and obtaining a picture identification result according to the first output vector and the second output vector.

8. The image recognition method of claim 7, wherein the deriving a picture recognition result according to the first output vector and the second output vector comprises:

and adding the first output vector and the second output vector to obtain a third output vector, and obtaining the picture identification result according to the third output vector.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and being executable on the processor, wherein the processor implements the steps of the method for training an image recognition model according to any of claims 1 to 6 or the steps of the method for image recognition according to any of claims 7 to 8 when executing the computer program.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the training method of the image recognition model according to any one of claims 1 to 6 or the steps of the image recognition method according to any one of claims 7 to 8.