CN116805387B

CN116805387B - Model training method, quality inspection method and related equipment based on knowledge distillation

Info

Publication number: CN116805387B
Application number: CN202311071587.0A
Authority: CN
Inventors: 何泽强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-08-24
Filing date: 2023-08-24
Publication date: 2023-11-21
Anticipated expiration: 2043-08-24
Also published as: CN116805387A

Abstract

The embodiment of the application provides a model training method, a quality inspection method and related equipment based on knowledge distillation, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like, and can be applied to scenes such as model training, equipment quality inspection and the like. The model training method comprises the following steps: acquiring a first target feature and a second target feature which are respectively extracted from an input image by a teacher model and a student model and comprise content features and dense features; determining weight matrixes respectively corresponding to the content features and the dense features of the first target features through an attention mechanism; determining respective corresponding feature loss values for the content features and the dense features based on the weight matrix; calculating a class loss value between a predicted class indicated by the dense feature of the second target feature and a real class corresponding to the input image; and adjusting network parameters of the student model based on the category loss value and the characteristic loss value. The application can improve the performance of the student model.

Description

Model training method, quality inspection method and related equipment based on knowledge distillation

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a model training method, a quality inspection method and related equipment based on knowledge distillation.

Background

The amount of parameters for deep learning models is generally large, resulting in a high time-consuming running of the model. The model compression technology is a method for accelerating the model under the condition of ensuring that model indexes are basically unchanged, and model compression is common in model quantization, model pruning, model knowledge distillation, light model searching and the like. The model knowledge distillation method comprises a teacher network and a student network, and the output result of the teacher network can be used as knowledge to guide the training of the student network, so that the index of the student network is close to the index effect of the teacher network.

However, in the existing knowledge distillation technology, the channel attention and the space attention are required to be manually calculated when the weight matrix is obtained, the calculation of the weight matrix depends on manual implementation, the cost is high, the effectiveness of the weight matrix cannot be ensured, the optimal weight matrix is difficult to calculate, the student network cannot learn the output result of the teacher network well and with emphasis, and the effect of training the obtained student network is poor.

Disclosure of Invention

The embodiment of the application provides a model training method, a quality inspection method and related equipment based on knowledge distillation for solving at least one technical problem. The technical scheme is as follows:

In a first aspect, an embodiment of the present application provides a knowledge distillation-based model training method, including:

acquiring a first target feature extracted from an input image based on a teacher model and a second target feature extracted from the input image based on a student model; the first target feature and the second target feature comprise content features for expressing content, dense features for predicting categories;

determining a characteristic weight matrix corresponding to the content characteristics of the first target characteristics and the dense characteristics of the first target characteristics respectively through an attention mechanism;

the following operations are respectively carried out for the content features and the dense features to determine the respective corresponding feature loss values: calculating a loss value between the extracted features of the teacher model and the student model, and determining a feature loss value based on the corresponding feature weight matrix and the loss value;

calculating a class loss value between a predicted class indicated by the class feature of the second target feature and a real class corresponding to the input image;

and adjusting network parameters of the student model based on the category loss value and the characteristic loss value.

In a possible embodiment, the determining, by an attention mechanism, a feature weight matrix corresponding to the content feature of the first target feature and the dense feature of the first target feature, respectively, includes performing the following attention weight calculation operations for the content feature of the first target feature and the dense feature of the first target feature, respectively:

Setting a learnable matrix, and determining a key matrix and a value matrix based on a first target feature extracted by the teacher model;

repeatedly executing assignment operation until a preset stopping condition is met, and determining a finally output leachable matrix as a corresponding characteristic weight matrix, wherein element values in the characteristic weight matrix represent the importance degrees of elements on a characteristic diagram corresponding to the first target characteristic;

wherein the assigning operation includes: and performing attention calculation based on a preset calculation rule based on the learnable matrix, the key matrix and the value matrix.

In a possible embodiment, the performing the attention calculation based on the learnable matrix, the key matrix, and the value matrix based on a preset calculation rule includes:

multiplying the transpose of the key matrix by the learnable matrix to obtain a first matrix;

multiplying the first matrix by a scaling factor to obtain a second matrix;

multiplying the second matrix by a mask matrix to obtain a third matrix; each element value in the mask matrix is a preset value;

performing probability distribution conversion operation on the third matrix to obtain a fourth matrix indicating the weight;

Multiplying the fourth matrix by the matrix of values and assigning a multiplication result to the learnable matrix.

In a possible embodiment, the determining the respective corresponding feature loss value for the content feature and the dense feature respectively performs the following operations: calculating a loss value between the extracted features of the teacher model and the student model, and determining a feature loss value based on the corresponding feature weight matrix and the loss value, including:

performing convolution processing on the content features of the second target features to enable the dimensionality of the content features of the first target features to be consistent with that of the content features of the second target features, calculating a square difference loss value between the content features of the first target features and the content features of the convolved second target features element by element, and multiplying the loss value by a feature weight matrix corresponding to the content features to obtain a first feature loss value;

and carrying out convolution processing on the dense features of the second target features so as to enable the dimensions of the dense features of the first target features and the dense features of the second target features to be consistent, calculating a square difference loss value between the dense features of the first target features and the dense features of the convolved second target features element by element, and multiplying the loss value by a feature weight matrix corresponding to the dense features to obtain a second feature loss value.

In a possible embodiment, the class features include dense features and sparse features; the calculating a class loss value between a predicted class indicated by the class feature of the second target feature and a real class corresponding to the input image includes:

acquiring sparse features of the second target feature determined based on dense features of the second target feature;

and respectively calculating class loss values between real classes corresponding to the input image according to the predicted class indicated by the dense features and the predicted class indicated by the sparse features of the second target features.

In a possible embodiment, the determining sparse features of the second target feature based on dense features of the second target feature includes:

predicting a candidate target frame in the input image based on dense features of the second target feature; the candidate target box indicates a location in the input image where a target object is present, predicted based on dense features of the second target feature;

and cutting the dense features of the second target features based on the coordinate information of the candidate target frame in the input image to obtain sparse features of the second target features.

In a possible embodiment, calculating a first class loss value between a predicted class indicated by the dense feature of the second target feature and a true class corresponding to the input image includes:

calculating a first dense loss value based on the dense features of the second target features and the error between the real target frame corresponding to the input image and the prior target frame;

predicting a prediction category to which a candidate target frame in the input image belongs based on the dense features of the second target features, and calculating a second dense loss value between the prediction category and a real category to which a real target frame corresponding to the input image belongs; the candidate target box indicates a location in the input image where a target object is present, predicted based on dense features of the second target feature;

a first class loss value between a predicted class indicated by the dense feature of the second target feature and a true class corresponding to the input image is determined based on the first dense loss value and the second dense loss value.

In a possible embodiment, the calculating, based on the dense feature of the second target feature, the error between the real target frame and the prior target frame corresponding to the input image, includes:

Acquiring a target error between a real target frame and a priori target frame corresponding to the input image;

predicting a first prediction error between the real target frame and an a priori target frame based on dense features of the second target features;

and calculating a first dense loss value between the target error and the first prediction error based on a preset cross ratio loss function.

In a possible embodiment, calculating a second class loss value between a predicted class indicated by the sparse feature of the second target feature and a true class corresponding to the input image includes:

calculating a first sparse loss value based on the sparse features of the second target features and errors between a real target frame and a priori target frame corresponding to the input image;

predicting a prediction category to which a candidate target frame in the input image belongs based on the sparse feature of the second target feature, and calculating a second sparse loss value between the prediction category and a real category to which a real target frame corresponding to the input image belongs; the candidate target box indicates a location in the input image where a target object is present, predicted based on sparse features of the second target feature;

And determining a second class loss value between a predicted class indicated by the sparse feature of the second target feature and a real class corresponding to the input image based on the first sparse loss value and the second sparse loss value.

In a possible embodiment, the calculating, based on the sparse feature of the second target feature, an error between a real target frame corresponding to the input image and an a priori target frame, includes:

predicting a second prediction error between a real target frame corresponding to the input image and the candidate target frame based on the sparse feature of the second target feature;

calculating a first sparse loss value between a target error and the second prediction error based on a preset cross ratio loss function; the target error is the error between the real target frame and the prior target frame corresponding to the input image.

In a possible embodiment, the method provided by the embodiment of the application further includes:

acquiring sparse features of the first target feature determined based on dense features of the first target feature;

performing logic distillation based on the sparse features of the first target features and the sparse features of the second target features to obtain logic loss values;

The adjusting the network parameters of the student model based on the category loss value and the feature loss value includes: and adjusting network parameters of the student model based on the sum of the category loss value, the feature loss value and the logic loss value.

In a possible embodiment, the teacher model and the student model each include a backbone module, a fusion module, a dense prediction module, and a sparse prediction module;

the content features comprise backbone features extracted from the input image by the backbone module, and fusion features determined by the fusion module based on the backbone features;

the dense features include features determined by the dense prediction module based on the fused features of the input;

the sparse prediction module is connected with the dense prediction module and is used for outputting sparse features based on the input dense features so as to determine the category loss value based on the dense features and the sparse features.

In a second aspect, an embodiment of the present application provides a quality inspection method, including:

identifying equipment images through a pre-trained quality inspection model, and determining whether defects exist in the equipment images and the defect types corresponding to the defects;

The quality inspection model is a student model trained by the knowledge distillation-based model training method in the first aspect.

In a possible embodiment, the quality inspection model comprises a first quality inspection model for detecting side defects of the equipment shell and a second quality inspection model for detecting large-surface defects of the equipment shell; the equipment image comprises a first image obtained by shooting the side surface of the equipment shell and a second image obtained by shooting the large surface of the equipment shell; the defect class is associated with a class of the device and includes a plurality of substantial defect classes and a plurality of non-substantial defect classes.

In a third aspect, an embodiment of the present application provides a knowledge distillation-based model training apparatus, including:

the device comprises a feature acquisition module, a feature extraction module and a feature extraction module, wherein the feature acquisition module is used for acquiring a first target feature extracted from an input image based on a teacher model and a second target feature extracted from the input image based on a student model; the first target feature and the second target feature comprise content features for expressing content, dense features for predicting categories;

the weight determining module is used for determining a characteristic weight matrix corresponding to the content characteristic of the first target characteristic and the dense characteristic of the first target characteristic respectively through an attention mechanism;

A feature loss determination module, configured to determine respective corresponding feature loss values for the content feature and the dense feature by performing the following operations: calculating a loss value between the extracted features of the teacher model and the student model, and determining a feature loss value based on the corresponding feature weight matrix and the loss value;

the category loss determining module is used for calculating a category loss value between a predicted category indicated by the category characteristic of the second target characteristic and a real category corresponding to the input image;

and the parameter adjustment module is used for adjusting the network parameters of the student model based on the category loss value and the characteristic loss value.

In a fourth aspect, an embodiment of the present application provides a quality inspection apparatus, including:

the image recognition module is used for recognizing the equipment image through the pre-trained quality inspection model and determining whether defects exist in the equipment image and the defect category corresponding to the defects;

In a fifth aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory, where the processor executes the computer program to implement the steps of the method provided in the first or second aspect.

In a sixth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the method provided in the first or second aspect described above.

In a seventh aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method provided in the first or second aspect described above.

The technical scheme provided by the embodiment of the application has the beneficial effects that:

in a first aspect, an embodiment of the present application provides a knowledge distillation based model training method, specifically, first, for an input image, a first target feature extracted by a teacher model and a second target feature extracted by a student model may be obtained, and since network parameters of the teacher model and the student model are different but the data flows are the same, the first target feature and the second target feature may each include a content feature for expressing content and a dense feature for predicting a category; on the basis, the embodiment of the application determines the feature weight matrix corresponding to the content features of the first target features and the feature weight matrix of the dense features of the first target features through the attention mechanism, namely calculates the attention weight matrix aiming at the features output by the teacher model, and the calculation of the attention weight matrix does not need to depend on manual implementation, so that the calculation cost of the weight matrix can be effectively reduced, the effectiveness of the obtained weight matrix can be ensured by processing through the attention mechanism, and the optimal weight matrix can be obtained; then, feature loss values between teacher model features and student model features can be calculated respectively for the content features and the dense features based on the corresponding feature weight matrix, so that errors between the teacher model features and the student model features are given different weights, the given errors with large weights can lead network optimization, and the calculation of the attention weight matrix can enable the student model to better and more effectively simulate the output of the teacher model, and the effect of the student model is improved; in addition, the embodiment of the application also calculates the class loss value between the real class result corresponding to the input image aiming at the predicted class result of the dense features of the student model, so that the characteristic loss value and the class loss value can be comprehensively considered when the network parameters of the student model are adjusted, the effects of the student model and the teacher model are more approximate, and the performance of the student model obtained by training is improved.

In a second aspect, an embodiment of the present application provides a quality inspection method, specifically, identifying, by using a pre-trained quality inspection model, an equipment image, and predicting whether a defect exists in the equipment image and a defect class corresponding to the defect; the quality inspection model can be a student model obtained by training by adopting the model training method of the first aspect, and the standard rate of quality inspection can be improved on the basis of improving the performance of the trained student model and on the basis of improving the defect quality inspection speed.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a flow chart of a knowledge distillation-based model training method provided by an embodiment of the application;

FIG. 2 is a flow chart of a quality inspection method according to an embodiment of the present application;

FIG. 3 is an overall frame diagram of a model provided in an embodiment of the present application;

FIG. 4 is a flow chart illustrating the operation of a model according to an embodiment of the present application;

FIG. 5 is a flow chart of the calculation of attention weights according to an embodiment of the present application;

FIG. 6 is a flowchart of a knowledge distillation algorithm according to an embodiment of the present application;

FIG. 7 is an image of a side of a device housing according to an embodiment of the present application;

FIG. 8 is a representation of a large surface of a device housing according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a knowledge-based model training apparatus according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a quality inspection device according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this specification, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, all of which may be included in the present specification. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates that at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

Embodiments of the present application relate to artificial intelligence (Artificial Intelligence, AI) which is the intelligence of a person using a digital computer or a machine controlled by a digital computer to simulate, extend and expand the environment, sense the environment, acquire knowledge and use knowledge to obtain optimal results, methods, techniques and applications. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

In particular, embodiments of the present application relate to machine learning and deep learning techniques.

Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The deep learning technology is a branch of machine learning, and is an algorithm for carrying out characterization learning on data by taking an artificial neural network as a framework. The technology can simulate the human brain and perform different degrees of perception analysis on the real world.

The embodiment of the application relates to a model knowledge distillation technology in deep learning.

In the model distillation knowledge method, two neural networks exist, wherein one network model has good effect but has a large quantity of parameters and a low running speed, the embodiment of the application is called a teacher model, the other network model has relatively poor effect but has a small quantity of parameters and a high running speed, and the embodiment of the application is called a student model. The knowledge distillation method can use the output result of the teacher network as knowledge to guide the training of the student network, so that the index of the student network is close to the index effect of the teacher network. Finally, the student network model can realize the characteristics of good effect and high speed.

Aiming at least one technical problem in the prior art, the embodiment of the application provides a model training scheme based on knowledge distillation, and a transducer technology is adopted to construct a weight matrix, so that the weight matrix can learn data of an application scene, obtain a weight matrix which is better and more adaptive to the scene, and is beneficial to improving the efficiency of model training; the implementation of the scheme can be combined with the weight matrix to calculate the error between the teacher characteristic and the student characteristic, so that the student model can have the important and more accurate simulation of the output result of the teacher model, and the effect of the student model obtained based on knowledge distillation training is improved.

The technical solutions of the embodiments of the present application and technical effects produced by the technical solutions of the present application are described below by describing several exemplary embodiments. It should be noted that the following embodiments may be referred to, or combined with each other, and the description will not be repeated for the same terms, similar features, similar implementation steps, and the like in different embodiments.

The following describes a knowledge distillation-based model training method in an embodiment of the present application.

Specifically, the execution subject of the method provided by the embodiment of the application can be a terminal or a server; the terminal (may also be referred to as a device) may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent voice interaction device (e.g., a smart speaker), a wearable electronic device (e.g., a smart watch), a vehicle-mounted terminal, a smart home appliance (e.g., a smart television), an AR/VR device, and the like. The server may be an independent physical server, a server cluster formed by a plurality of physical servers or a distributed system (such as a distributed cloud storage system), or a cloud server for providing cloud computing and cloud storage services.

In a possible embodiment, as shown in fig. 3 and fig. 4, in the knowledge distillation-based model training method provided in the embodiment of the present application, although parameters of a teacher model and a student model are different, data flows of the teacher model and the student model are the same, and accordingly, a backbone backup module, a fusion stack module, and a dense prediction module may be included in the teacher model and the student model.

Specifically, as shown in fig. 1, the knowledge distillation-based model training method includes steps S101 to S105:

step S101: acquiring a first target feature extracted from an input image based on a teacher model and a second target feature extracted from the input image based on a student model; the first target feature and the second target feature include content features for expressing image content, dense features for predicting image categories.

In knowledge distillation, parameters of a teacher model are fixed, and output results of the teacher model can be used as knowledge to conduct supervision training on a student model so as to optimize network parameters of the student model. It can be understood that the teacher model and the student model are independent models, and relatively speaking, the teacher model has good effect, but the parameter quantity is more and the running speed is slower; the effect of the student model is relatively poor, but the running speed is high due to the small quantity of parameters; according to the embodiment of the application, the student model is supervised and trained through the output result of the teacher model, so that the index effect of the student model is close to that of the teacher model, and finally, the purpose of good effect and high running speed of the student model can be realized.

The input image may be used as a training sample of the student model, and may be data acquired based on the application scene. Accordingly, the teacher model may also be applicable to the same application scenario. Alternatively, a plurality of images may be included in a training sample for training a student model, and the input image referred to in the embodiments of the present application may be one image in the training sample, where each image included in the training sample has a tag for identifying its true category. If the method is applied to the scene of image content identification, each image is provided with a real target frame for identifying the position of a target object to be identified and a label for identifying the real category corresponding to the real target frame.

In the feature acquisition process, the teacher model may extract a first target feature from the input image, the first target feature including a content feature for expressing image content and a dense feature for predicting image categories. Accordingly, the student model may extract second target features from the input image, the second target features including content features for expressing image content and dense features for predicting image categories. Wherein the first target feature is different from the second target feature.

Optionally, the content features include backbone features extracted from the input image by a backbone module, and the fusion features are determined by a fusion module based on the backbone features. As shown in fig. 4, for one input image, the teacher model and the student model may extract the corresponding backbone feature_backup through the backbone module, and input the backbone feature to the fusion module, so as to obtain the fusion feature_backup extracted by the fusion module. Wherein, the network frame and the hierarchical relation of the backbone module and the fusion module can adapt to the adjustment of the application scene, and the embodiment of the application is not limited to this.

Optionally, after the dense prediction modules are sequentially connected to the fusion module, after the fusion module obtains the fusion feature, the fusion feature can be input into the dense prediction module to obtain the dense feature_dense_prediction extracted by the dense prediction module.

In the embodiment of the present application, performing step S101 may obtain the corresponding features shown in the following tables 1 and 2:

table 1 features extracted based on teacher model

Table 2 features extracted based on student model

Alternatively, the predicted image category may be adjusted based on the adapted application scenario, e.g. adapted to a multi-classification scenario, the predicted image category may indicate the category to which the image belongs, e.g. for animal image classification, the predicted image category may indicate that the image belongs to a puppy image, a panda image, or a panda image. If the method is suitable for a classification scene (quality inspection is performed based on the image), the predicted image category can be used for indicating whether the image can pass the quality inspection, if so, whether defects exist (positions corresponding to the defects exist or not can also be indicated), and the predicted image category can be identified by a label 0 or a label 1; on the basis, if multiple defect types exist, the defect detection method can also adapt to the detected defect positions to respectively identify the detected defect positions by adopting labels with different defect types.

Step S102: and determining a characteristic weight matrix corresponding to the content characteristic of the first target characteristic and the dense characteristic of the first target characteristic respectively through an attention mechanism.

The feature weight matrix for calculating the output features of the teacher model based on the attention mechanism belongs to the attention weight matrix, each element value in the matrix can be between 0 and 1, and each element value represents the importance degree of the element on the feature map of the corresponding first target feature. The larger a certain element value in the attention weight matrix, the more important the influence of the corresponding characteristic value on the output result is indicated. When the corresponding attention weight matrix is acquired, the error between the result output by the teacher model and the result output by the student model can be given different weights, and the characteristic element error given a large weight can dominate network optimization.

The teacher model can extract content features (including backbone features and fusion features) and dense prediction features, so that the teacher model can adapt to different features to respectively determine respective corresponding feature weight matrixes. That is, performing step S102 may obtain the feature weight matrix as shown in table 3 below:

TABLE 3 correspondence between features extracted by teacher model and feature weight matrix

Step S103: the following operations are respectively carried out for the content features and the dense features to determine the respective corresponding feature loss values: and calculating a loss value between the extracted features of the teacher model and the student model, and determining a feature loss value based on the corresponding feature weight matrix and the loss value.

When the loss between the output characteristics of the teacher model and the output characteristics of the student model is calculated, the calculation of the loss value can be performed by combining the characteristic weight matrix, so that the student model can be enabled to be modeled like the teacher model with emphasis. Considering that different modules in the model output different characteristics, the corresponding characteristic loss values are calculated for the characteristics output by the different modules respectively. Upon execution of step S103, the feature loss values shown in table 4 below can be obtained:

TABLE 4 Table 4

Step S104: and calculating a class loss value between a predicted class indicated by the dense feature of the second target feature and a real class corresponding to the input image.

The function of the dense prediction module is to detect the candidate target frame proposal, that is, detect the position of the target possibly required to be identified from the image.

The dense feature output by the dense prediction module can be used for predicting two types of information, namely, predicting the position of a required identification object (candidate target frame) and predicting the category corresponding to the candidate target frame.

In the embodiment of the application, when the prediction category output by the student model is optimized, a non-distillation mode can be adopted to calculate the category loss value between the prediction category indicated by the dense characteristic of the second target characteristic and the real category corresponding to the input image.

Step S105: and adjusting network parameters of the student model based on the category loss value and the characteristic loss value.

Optionally, the class loss value and the feature loss value may be combined to adjust the network parameters of the student model, such as taking the sum of the class loss value and the feature loss value as the total loss value.

In the embodiment of the application, the error between the output of the student model and the output of the teacher model is calculated so as to optimize the parameters of the student network, so that the error is smaller and smaller. Wherein, because the student model and the teacher model output image features, the dimensions of the features can be N, C, H and W; where N represents the number of features, C represents the number of channels of the features, H represents the height of the features, and W represents the width of the features. There are n×c×h×w data in total. The magnitude of the influence of each data value on the final effect is different, and when considering how to select effective data to enable a student model to imitate a teacher model, the embodiment of the application adopts a transform technology to construct an attention weight matrix W, wherein the weight matrix represents the importance degree of N, H and W element values on the output result of the final model. According to the application, the error between the output of the teacher model and the output of the student model is calculated by combining the attention weight matrix, and the student model is optimized by the error, so that the output result of the student model can be emphasized to imitate the output result of the teacher model, the aim that the effect of the student model is similar to that of the teacher model is achieved, and the student model has the characteristics of high running speed due to small parameter quantity of the student model, and can obtain a student model with good output effect and high running speed by knowledge distillation model village.

The following describes the specific content of the attention weight matrix in the embodiment of the present application.

In a possible embodiment, determining, in step S102, a feature weight matrix corresponding to the content feature of the first target feature and the dense feature of the first target feature, respectively, by using an attention mechanism includes performing the following attention weight calculation operations on the content feature of the first target feature and the dense feature of the first target feature, respectively, in steps A1-A2:

step A1: setting a learnable matrix, and determining a key matrix and a value matrix based on the first target feature extracted by the teacher model.

In the calculation of the attention weight matrix, a learnable matrix Q may be set first, and the first target feature output by the obtained teacher model is assigned to K (key) and V (value) to obtain a key matrix and a value matrix, and the weight attention matrix w may be calculated using the following formula (1):

.. the formula (1)

Wherein the method comprises the steps ofThe channel dimension representing the feature is a constant.

Step A2: and repeatedly executing assignment operation until a preset stopping condition is met, and determining the finally output leachable matrix as a corresponding characteristic weight matrix, wherein element values in the characteristic weight matrix represent the importance degree of elements on the corresponding characteristic graph.

The assignment operation may be performing attention calculation based on a preset calculation rule based on a learnable matrix, a key matrix and a value matrix. The preset stop condition may be to repeatedly perform the assignment operation a preset number of times, or other preset stop condition. Assuming that the current preset stopping condition is that the assignment operation is executed for N times, the finally output leachable matrix is the result obtained by N times of iteration.

Optionally, as shown in fig. 5, in an iteration, the attention calculation is performed in step A2 based on the learnable matrix, the key matrix, and the value matrix based on a preset calculation rule, including steps a 21-a 25:

step A21: multiplying the transpose of the key matrix and the learnable matrix to obtain a first matrix.

Step A22: multiplying the first matrix by a scaling factor to obtain a second matrix.

Step A23: multiplying the second matrix by a mask matrix to obtain a third matrix; and each element value in the mask matrix is a preset numerical value.

Step A24: and carrying out probability distribution conversion operation on the third matrix to obtain a fourth matrix indicating the weight.

Step A25: multiplying the fourth matrix by the matrix of values and assigning a multiplication result to the learnable matrix.

Wherein, as shown in fig. 5, the network for calculating the attention weight matrix includes 5 layers, namely a MatMul layer for performing a first matrix multiplication operation, a Scale layer for scaling, a Mask layer for considering Mask tensors, a Softmax layer for converting into probability distribution and a MatMul layer for performing a second matrix multiplication operation. Based on the attention calculation rule corresponding to the network result, the specific calculation process is as follows: transpose the key matrix K, and then multiply the transposed matrix with the learnable matrix Q to obtain a first matrix; then, multiplying the first matrix by a scaling coefficient to obtain a second matrix, multiplying the second matrix by a mask matrix with element values of all 1 to obtain a third matrix, then, performing softmax operation to convert probability distribution of the third matrix to obtain a weight matrix (fourth matrix), and multiplying the fourth matrix by the value matrix to obtain the learnable matrix of the current iteration.

In the embodiment of the application, the calculation operation of the attention weight matrix is respectively executed aiming at the backbone features, the fusion features and the dense features extracted by the teacher model, so as to obtain three feature weight matrices corresponding to the output features of different modules.

The following describes specific details of determining the feature loss value in the embodiment of the present application.

In a possible embodiment, the following operations are performed in step S103 for the content feature and the dense feature, respectively, to determine the respective corresponding feature loss values: calculating a loss value between the extracted features of the teacher model and the student model, and determining a feature loss value based on the corresponding feature weight matrix and the loss value, wherein the method comprises the steps of B1-B2:

step B1: and carrying out convolution processing on the content features of the second target features so as to enable the dimensionality of the content features of the first target features to be consistent with that of the content features of the second target features, calculating a square difference loss value between the content features of the first target features and the convolved content features of the second target features element by element, and multiplying the loss value by a feature weight matrix corresponding to the content features to obtain a first feature loss value.

Step B2: and carrying out convolution processing on the dense features of the second target features so as to enable the dimensions of the dense features of the first target features and the dense features of the second target features to be consistent, calculating a square difference loss value between the dense features of the first target features and the dense features of the convolved second target features element by element, and multiplying the loss value by a feature weight matrix corresponding to the dense features to obtain a second feature loss value.

The backbone features extracted by the student model are illustrated below:

and carrying out convolution operation on the acquired backbone feature student_feature_back output by the student model through a layer of convolution neural network, so that the dimension of the backbone feature output by the student model is consistent with the dimension of the backbone feature test_back output by the teacher model, wherein the two have N, C, H and W elements, N represents the number of features, C represents the number of feature channels, H represents the feature height, and W represents the feature width. First, the square difference loss loss_feature_back between the backbone feature test_feature_back output by the teacher model and the backbone feature student_feature_back output by the convoluted student model can be calculated element by element, as shown in the following formula (2):

..

Wherein,representing the true value (backbone feature of teacher model output),>the smaller the value representing the predicted value (backbone characteristics output by the convolved student model) the smaller the difference in square loss, which means that the smaller the gap between the student model and the teacher model, the better the performance of the student model.

Obtaining a characteristic weight matrix teacher_backup_w of backbone characteristics and multiplying the square difference loss to obtain a characteristic loss value loss_kd_backup of knowledge distillation, wherein the characteristic loss value loss_kd_backup is shown in the following formula (3):

loss_kd_backbone = teacher_backbone_W * loss_feature_backbone

..

By the processing of the formula (3), different importance degrees can be given to different element values on the characteristics. In optimizing knowledge distillation loss, elements that have an impact on the final prediction result are also optimized with emphasis based on different weights.

Based on the above example for backbone features, the following two feature loss values can be obtained in the same way:

(1) And (3) recording the weighted loss between the fusion feature test_feature_neg output by the teacher model, the fusion feature student_feature_neg output by the student model and the feature weight matrix test_neg_w of the fusion feature as loss_kd_neg.

(2) The weighted loss between the dense feature test_feature_dense_prediction output by the teacher model and the feature weight matrix test_dense_prediction_w of the dense feature output by the student model is recorded as loss_kd_dense_prediction.

The following describes the specific content of calculating the class loss value in the embodiment of the present application.

In the embodiment of the application, in order to better train a student model, the embodiment of the application further expands the characteristics of knowledge distillation, and as shown in fig. 3 and 4, the application further comprises a sparse prediction_prediction module, wherein the module can further predict the candidate target frames detected by the dense prediction module, remove incorrect candidate target frames and finely adjust the correct candidate target frames so that the adjusted candidate target frames are closer to a real target frame gt_box indicating a required recognition object in an image. Based on this, sparse features may be derived based on dense features. In the network frame, the sparse prediction module is connected with the dense prediction module, and the sparse prediction module can output sparse features based on the input dense features so as to combine the dense features and the sparse features for consideration in the calculation of the class loss value, and the accuracy of detection and class prediction of the candidate target frame is improved.

In a possible embodiment, calculating a class loss value between the predicted class indicated by the class feature of the second target feature and the real class corresponding to the input image in step S104 includes: acquiring sparse features of the second target feature determined based on dense features of the second target feature; and respectively calculating class loss values between real classes corresponding to the input image according to the predicted class indicated by the dense features and the predicted class indicated by the sparse features of the second target features.

When the classification prediction result of the student model is optimized, the dense features are processed, sparse features can be obtained based on the dense features, and then errors between the prediction category of the student model and the real category corresponding to the input image are considered by combining the sparse features.

Optionally, determining sparse features of the second target feature based on dense features of the second target feature in the above step includes steps C1-C2:

step C1: predicting a candidate target frame in the input image based on dense features of the second target feature; the candidate target box indicates a location in the input image where a target object is present that is predicted based on dense features of the second target feature.

Step C2: and cutting the dense features of the second target features based on the coordinate information of the candidate target frame in the input image to obtain sparse features of the second target features.

Specifically, a candidate target frame student_process (the number is far smaller than the number of prior target frames) can be predicted according to the dense feature student_feature_dense_prediction output by the student model, and then coordinate information of the predicted candidate target frame in an input image is used for clipping the dense feature student_feature_dense_prediction output by the student model, so that sparse feature student_feature_feature_prediction of the second target feature is obtained.

Optionally, the determining process of the sparse feature in the first target feature is the same as the logic of the step C1-step C2, and the candidate target frame teaser_process may be predicted according to the dense feature teaser_feature_dense_prediction output by the teacher model, and then coordinate information of the predicted candidate target frame in the input image is used to clip the dense feature teaser_feature_dense_prediction output by the teacher model, so as to obtain the sparse feature teaser_feature_feature_prediction of the first target feature.

The following describes specific details of determining a class loss value in an embodiment of the present application.

In a possible embodiment, the step S104 of calculating a first class loss value between the predicted class indicated by the dense feature of the second target feature and the real class corresponding to the input image includes steps D1-D3:

step D1: and calculating a first dense loss value based on the dense features of the second target features and the error between the real target frame corresponding to the input image and the prior target frame.

The real target frame of the input image can be marked manually, and the coordinate information of the real target frame in the input image is represented by coordinates with four values, such as a real target frame gt_box (x 1, y1, w1, h 1), wherein x and y represent the central coordinates of the target frame, and w and h represent the width and the height of the target frame.

The prior target frame of the input image may include a plurality of prior rectangular frames of different sizes and different aspect ratios (x, y, w, h) which are empirically set, wherein x, y represents the center point coordinates of the rectangular frames, and w, h represent the width and height of the rectangular frames, respectively. The prior target frames primary_box (x 0, y0, w0, h 0) have a very large number, and can be densely tiled on the Input picture so as to cover any position in the picture or cover all targets in the picture.

The error gt_delta between the prior target frame prior_box and the artificially marked real target frame gt_box can be used as a target of network prediction. The first dense loss value may thus be calculated in combination with a prediction error prediction_delta between the prior target box and the real target box predicted based on the dense feature of the second target feature, a real error (target error) gt_delta between the prior target box priority_box and the artificially noted real target box gt_box.

Optionally, in step D1, calculating a first dense loss value based on an error between a real target frame corresponding to the input image and an a priori target frame based on dense features of the second target feature, including steps D11-D13:

step D11: and acquiring a target error between a real target frame and a priori target frame corresponding to the input image.

The target error gt_delta can be determined based on a real target frame gt_box and a priori target frame priority_box corresponding to the input image; the target error belongs to the true error.

Step D12: a first prediction error between the real target frame and an a priori target frame is predicted based on dense features of the second target features.

The dense features of the second target feature may be used to predict an error between the real target frame gt_box corresponding to the input image and the prior target frame priority_box, to obtain a prediction error prediction_delta.

Step D13: and calculating a first dense loss value between the target error and the first prediction error based on a preset cross ratio loss function.

Wherein, the cross ratio loss function iou loss can be expressed as the following formula (4):

..

Wherein the numerator is the intersection of the target error and the prediction error and the denominator is the union of the target error and the prediction error.

The process of optimizing the student model network parameters is the process of continuously reducing the loss value of the io loss, namely the predicted error prediction_delta is more and more close to the real error gt_delta, and finally the predicted error prediction_delta information can be added on the basis of the prior-inspection target frame prior_bbox to obtain the final target position estimation.

Step D2: predicting a prediction category to which a candidate target frame in the input image belongs based on the dense features of the second target features, and calculating a second dense loss value between the prediction category and a real category to which a real target frame corresponding to the input image belongs; the candidate target box indicates a location in the input image where a target object is present that is predicted based on dense features of the second target feature.

The dense feature of the second target feature may also predict which class the corresponding candidate target frame belongs to, and the cross entropy loss function Cross Entropy Loss may be used to calculate an error between the predicted class and the true class.

Step D3: a first class loss value between a predicted class indicated by the dense feature of the second target feature and a true class corresponding to the input image is determined based on the first dense loss value and the second dense loss value.

Alternatively, a first class loss value loss_dense_prediction between a predicted class indicated by the dense feature of the second target feature, which is a sum of the first dense loss value and the second dense loss value, and a true class corresponding to the input image may be used.

In a possible embodiment, the calculating in step S104 a second class loss value between the predicted class indicated by the sparse feature of the second target feature and the real class corresponding to the input image includes steps E1-E3:

step E1: and calculating a first sparse loss value based on the sparse features of the second target features and errors between the real target frame and the prior target frame corresponding to the input image.

The description of the real target frame and the prior target frame may refer to the related example of the step D1, which is not described herein.

Optionally, in step E1, a first sparsity loss value is calculated based on the sparse feature of the second target feature, and an error between a real target frame and an a priori target frame corresponding to the input image, including steps E11-E12:

step E11: and predicting a second prediction error between a real target frame corresponding to the input image and the candidate target frame based on the sparse feature of the second target feature.

Step E12: calculating a first sparse loss value between a target error and the second prediction error based on a preset cross ratio loss function; the target error is the error between the real target frame and the prior target frame corresponding to the input image.

When the sparse feature of the second target feature is obtained, the sparse feature of the second target feature may be used to predict the error between the candidate target frame student_prediction and the artificially marked real target frame gt_box, calculate the coordinate error between the candidate target frame student_prediction and the real target frame gt_box, and calculate the deviation between the prediction result and the real result by using Iou Loss.

Step E2: predicting a prediction category to which a candidate target frame in the input image belongs based on the sparse feature of the second target feature, and calculating a second sparse loss value between the prediction category and a real category to which a real target frame corresponding to the input image belongs; the candidate target box indicates a location in the input image where a target object is present that is predicted based on sparse features of the second target feature.

Alternatively, a cross entropy loss function Cross EntropyLoss may be employed to calculate the error loss between the prediction category described by the sparse feature prediction candidate target box of the second target feature and the true category to which the true target box belongs.

Step E3: and determining a second class loss value between a predicted class indicated by the sparse feature of the second target feature and a real class corresponding to the input image based on the first sparse loss value and the second sparse loss value.

Alternatively, a second class loss value loss_sparse_prediction between the prediction class indicated by the sparse feature of the second target feature, which is the sum of the first sparse loss value and the second sparse loss value, and the real class corresponding to the input image may be used.

In the embodiment of the application, according to different distillation objects, feature-based knowledge distillation and logic-based knowledge distillation can be divided, wherein feature-based distillation refers to that the output features of a student network imitate the output features of a teacher network, and logic-based distillation refers to that the prediction category distribution of the student network imitates the prediction category distribution of the teacher network. The details of logic distillation used in connection with embodiments of feature-based knowledge distillation are described below with respect to embodiments of the present application.

In a possible embodiment, the knowledge distillation-based model training method provided by the embodiment of the application further includes step F1-step F2:

step F1: sparse features of the first target feature determined based on dense features of the first target feature are acquired.

Alternatively, the operation of determining the sparse feature of the first target feature based on the dense feature of the first target feature may refer to a specific example of acquiring the sparse feature in the foregoing embodiment, which is not described herein.

Step F2: and carrying out logic distillation based on the sparse features of the first target features and the sparse features of the second target features to obtain logic loss values.

Optionally, a DKD (Decoupled Knowledge Distillation) algorithm may be used to logically distill the classification prediction result of the sparse prediction module to obtain a logic loss value loss_kd_spark_prediction.

Accordingly, adjusting the network parameters of the student model based on the category loss value and the feature loss value in step S105 includes: and adjusting network parameters of the student model based on the sum of the category loss value, the feature loss value and the logic loss value.

Alternatively, the total loss value shown in the following formula (5) can be obtained by combining the loss values calculated in the above embodiments:

Loss = loss_kd_backbone +

loss_kd_neck +

loss_dense_predict +

loss_kd_dense_predict +

loss_sparse_predict +

loss_kd_sparse_predict

..

Alternatively, the network parameters of the student model may be optimized using a gradient descent method based on the total loss value.

In the embodiment of the application, as shown in fig. 6, the feature-based knowledge distillation scheme and the logic-based knowledge distillation scheme are combined to make knowledge distillation more comprehensive, and the performance of the student model obtained by training can be effectively improved.

The method provided by the embodiment of the application carries out knowledge distillation on the characteristics of the network model, has strong universality, and can be transplanted to various tasks such as a target detection algorithm, an instance segmentation algorithm, a semantic segmentation algorithm, a target classification algorithm and the like without modification. In addition, after the student model is obtained through training, the calculation force required by the model can be effectively reduced, the cost is saved, the operation speed of the student model through knowledge distillation can reach 2 times of the operation speed of a teacher model, and about half of GPU resources are saved. In the deployment of the application, a deep learning model algorithm can be deployed on low-power equipment, for example, a student model obtained by training can be deployed on terminal equipment such as a mobile phone, vehicle-mounted equipment and the like.

In the embodiment of the application, considering the quality inspection scene of equipment, such as a notebook computer shell, a tablet computer shell, an electronic watch shell and the like, the defects of shell bruise, crush injury, scratch and the like are unavoidable in the industrial production process, and whether the shell is defective or not is generally identified by manually checking the shell or adopting a machine quality inspection method.

For quality inspection of industrial defects, the omission ratio and the overstock ratio of the model are generally referred to. The leakage rate is N, i.e. the number of input products is i, the number of defective products is i, the model only detects the number j (j < i), the leakage rate is (i-j), and the leakage rate is (i-j)/N100%; the over-killing rate refers to the number of input products being N, and the number of defective products is k by mistake, so that the over-killing rate is k/N100%.

Wherein, because the resolution of the device image is high, the number of pictures corresponding to a device is large, and the time required for identifying whether the device has defects is long. Therefore, the reduction of time consumption while ensuring quality inspection effects is a urgent problem to be solved under the conditions of saving cost and reducing GPU power.

The embodiment of the application considers that defects on products are found, the leak rate and the over-killing rate are ensured to reach the standard, and meanwhile, the defect quality inspection speed is improved.

Based on this, the embodiment of the present application further provides a quality inspection method, as shown in fig. 2, where the method specifically includes step S201: and identifying the equipment image through a pre-trained quality inspection model, and determining whether defects exist in the equipment image and the defect category corresponding to the defects.

The quality inspection model is a student model trained by a model training method based on knowledge distillation. Through knowledge distillation, a student model with good quality inspection effect and high running speed can be obtained and used as a quality inspection model.

Optionally, when detecting defects of the electronic device casing, there may be two imaging devices, such as a smart camera for photographing a side (as shown in fig. 7), and a line scanner for photographing a front (as shown in fig. 8), where the line scanner is matched with three light sources, namely, bright field, dark field and bright and dark field. Accordingly, the quality inspection model may include a first quality inspection model for detecting side defects of the equipment enclosure and a second quality inspection model for detecting large-area defects of the equipment enclosure; the device image includes a first image taken of a side of the device housing and a second image taken of a large side of the device housing.

Optionally, the defect class is related to a class of the device and includes a plurality of substantial defect classes and a plurality of non-substantial defect classes. Examples of the substantial defect types include scratch (HS), bruise (PS), crush (YS), moire (DW), bright mark (LH), polished mark (PH), heterochromatic (YIS), film (GM), black line (HX), bright spot (LD), white spot (BD), oxidation (YH), and the like, and examples of the non-substantial defect types include dirt (ZW), foreign matter (YW), texture (WL), and the like.

It should be noted that, in alternative embodiments of the present application, the related data (such as content features, dense features, input images, device images, etc.) may be subject to permission or consent when the above embodiments of the present application are applied to a specific product or technology, and the collection, use and processing of the related data may be subject to relevant laws and regulations and standards of relevant countries and regions. That is, in the embodiment of the present application, if data related to the subject is involved, the data needs to be obtained through the subject authorization consent and in accordance with the relevant laws and regulations and standards of the country and region.

An embodiment of the present application provides a knowledge distillation based model training apparatus, as shown in fig. 9, the knowledge distillation based model training apparatus 100 may include: a feature acquisition module 101, a weight determination module 102, a feature loss determination module 103, a category loss determination module 104, and a parameter adjustment module 105.

Wherein, the feature acquisition module 101 is used for acquiring a first target feature extracted from an input image based on a teacher model and a second target feature extracted from the input image based on a student model; the first target feature and the second target feature comprise content features for expressing content, dense features for predicting categories; the weight determining module 102 is configured to determine, by using an attention mechanism, a feature weight matrix corresponding to a content feature of the first target feature and a dense feature of the first target feature respectively; a feature loss determination module 103, configured to determine respective corresponding feature loss values for the content feature and the dense feature by performing the following operations: calculating a loss value between the extracted features of the teacher model and the student model, and determining a feature loss value based on the corresponding feature weight matrix and the loss value; a category loss determination module 104, configured to calculate a category loss value between a predicted category indicated by the dense feature of the second target feature and a real category corresponding to the input image; a parameter adjustment module 105, configured to adjust a network parameter of the student model based on the category loss value and the feature loss value.

An embodiment of the present application provides a quality inspection device, as shown in fig. 10, the quality inspection device 200 may include: the image identifies the model 201.

The image recognition module 201 is configured to recognize an equipment image through a pre-trained quality inspection model, and determine whether a defect exists in the equipment image and a defect class corresponding to the defect;

the quality inspection model is a student model obtained by training through the knowledge distillation-based model training method.

The device of the embodiment of the present application may perform the method provided by the embodiment of the present application, and its implementation principle is similar, and actions performed by each module in the device of the embodiment of the present application correspond to steps in the method of the embodiment of the present application, and detailed functional descriptions of each module of the device may be referred to the descriptions in the corresponding methods shown in the foregoing, which are not repeated herein.

The modules involved in the embodiments of the present application may be implemented in software. The name of the module is not limited to the module itself in some cases, and for example, the feature acquisition module may also be described as "a module that acquires a first target feature extracted from an input image based on a teacher model and a second target feature extracted from the input image based on a student model", "a first module", or the like.

The embodiment of the application provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps of a model training method based on knowledge distillation, and compared with the related technology, the method can realize the steps of the model training method based on knowledge distillation:

In a second aspect, an embodiment of the present application provides a quality inspection method, specifically, identifying, by using a pre-trained quality inspection model, an equipment image, and predicting a defect existing in the equipment image and a defect class corresponding to the defect; the quality inspection model can be a student model obtained by training by adopting the model training method of the first aspect, and the standard rate of quality inspection can be improved on the basis of improving the performance of the trained student model and on the basis of improving the defect quality inspection speed.

In an alternative embodiment, there is provided an electronic device, as shown in fig. 11, the electronic device 4000 shown in fig. 11 includes: a processor 4001 and a memory 4003. Wherein the processor 4001 is coupled to the memory 4003, such as via a bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004, the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The processor 4001 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor 4001 may also be a combination that implements computing functionality, e.g., comprising one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.

Bus 4002 may include a path to transfer information between the aforementioned components. Bus 4002 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture ) bus, or the like. The bus 4002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 11, but not only one bus or one type of bus.

Memory 4003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer.

The memory 4003 is used for storing a computer program for executing an embodiment of the present application, and is controlled to be executed by the processor 4001. The processor 4001 is configured to execute a computer program stored in the memory 4003 to realize the steps shown in the foregoing method embodiment.

Among them, electronic devices include, but are not limited to: terminal and server.

Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the foregoing method embodiments and corresponding content.

The embodiment of the application also provides a computer program product, which comprises a computer program, wherein the computer program can realize the steps and corresponding contents of the embodiment of the method when being executed by a processor.

The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate, such that the embodiments of the application described herein may be implemented in other sequences than those illustrated or otherwise described.

It should be understood that, although various operation steps are indicated by arrows in the flowcharts of the embodiments of the present application, the order in which these steps are implemented is not limited to the order indicated by the arrows. In some implementations of embodiments of the application, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the case of different execution time, the execution sequence of the sub-steps or stages can be flexibly configured according to the requirement, which is not limited by the embodiment of the present application.

The foregoing is merely an optional implementation manner of some of the implementation scenarios of the present application, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the present application are adopted without departing from the technical ideas of the scheme of the present application, and the implementation manner is also within the protection scope of the embodiments of the present application.

Claims

1. A knowledge distillation-based model training method, comprising:

acquiring a first target feature extracted from an input image based on a teacher model and a second target feature extracted from the input image based on a student model; the first target feature and the second target feature comprise content features for expressing image content, dense features for predicting image categories;

the following operations are respectively carried out for the content features and the dense features to determine the respective corresponding feature loss values: calculating a loss value between the extracted features of the teacher model and the student model, and determining a feature loss value based on the corresponding feature weight matrix and the loss value, including: performing convolution processing on the content features of the second target features to enable the dimensionality of the content features of the first target features to be consistent with that of the content features of the second target features, calculating loss values between the content features of the first target features and the content features of the convolved second target features element by element, and multiplying the loss values by a feature weight matrix corresponding to the content features to obtain first feature loss values; performing convolution processing on the dense features of the second target features to enable the dimensions of the dense features of the first target features and the dense features of the second target features to be consistent, calculating loss values between the dense features of the first target features and the dense features of the convolved second target features element by element, and multiplying the loss values by a feature weight matrix corresponding to the dense features to obtain second feature loss values;

Calculating a class loss value between a predicted class indicated by the dense feature of the second target feature and a real class corresponding to the input image;

network parameters of the student model are adjusted based on the category loss value, the first feature loss value, and the second feature loss value.

2. The method according to claim 1, wherein determining, by an attention mechanism, feature weight matrices respectively corresponding to content features of the first target feature and dense features of the first target feature comprises performing the following attention weight calculation operations for the content features of the first target feature and the dense features of the first target feature, respectively:

3. The method of claim 2, wherein the performing attention calculations based on the learnable matrix, the key matrix, and the value matrix based on preset calculation rules comprises:

multiplying the first matrix by a scaling factor to obtain a second matrix;

4. The method of claim 1, wherein the calculating a class loss value between a predicted class indicated by a class feature of the second target feature and a true class corresponding to the input image comprises:

5. The method of claim 4, wherein the determining sparse features of the second target feature based on dense features of the second target feature comprises:

6. The method of claim 4, wherein calculating a first class loss value between a predicted class indicated by the dense feature of the second target feature and a true class corresponding to the input image comprises:

7. The method of claim 6, wherein the calculating a first dense loss value based on the dense feature of the second target feature, the error between the real target frame and the prior target frame corresponding to the input image, comprises:

8. The method according to any of claims 4-7, wherein calculating a second class loss value between a predicted class indicated by sparse features of the second target feature and a true class corresponding to the input image comprises:

9. The method of claim 8, wherein the calculating a first sparsity loss value based on the sparse feature of the second target feature, the error between a real target box and an a priori target box corresponding to the input image, comprises:

10. The method as recited in claim 1, further comprising:

11. A method of quality testing comprising:

wherein the quality inspection model is a student model trained by the knowledge-based model training method of any one of claims 1-10.

12. The method of claim 11, wherein the quality inspection model comprises a first quality inspection model for detecting side defects of the equipment enclosure and a second quality inspection model for detecting large surface defects of the equipment enclosure; the equipment image comprises a first image obtained by shooting the side surface of the equipment shell and a second image obtained by shooting the large surface of the equipment shell; the defect class is associated with a class of the device and includes a plurality of substantial defect classes and a plurality of non-substantial defect classes.

13. A knowledge distillation based model training apparatus, comprising:

a feature loss determination module, configured to determine respective corresponding feature loss values for the content feature and the dense feature by performing the following operations: calculating a loss value between the extracted features of the teacher model and the student model, and determining a feature loss value based on the corresponding feature weight matrix and the loss value, including: performing convolution processing on the content features of the second target features to enable the dimensionality of the content features of the first target features to be consistent with that of the content features of the second target features, calculating loss values between the content features of the first target features and the content features of the convolved second target features element by element, and multiplying the loss values by a feature weight matrix corresponding to the content features to obtain first feature loss values; performing convolution processing on the dense features of the second target features to enable the dimensions of the dense features of the first target features and the dense features of the second target features to be consistent, calculating loss values between the dense features of the first target features and the dense features of the convolved second target features element by element, and multiplying the loss values by a feature weight matrix corresponding to the dense features to obtain second feature loss values;

and the parameter adjustment module is used for adjusting the network parameters of the student model based on the category loss value, the first characteristic loss value and the second characteristic loss value.

14. A quality control apparatus, comprising:

15. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method of any one of claims 1-12.

16. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1-12.