CN115170919B

CN115170919B - Image processing model training and image processing method, device, equipment and storage medium

Info

Publication number: CN115170919B
Application number: CN202210759709.4A
Authority: CN
Inventors: 杨馥魁
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2023-09-12
Anticipated expiration: 2042-06-29
Also published as: CN115170919A

Abstract

The disclosure provides an image processing model training and image processing method, device, equipment and storage medium, and relates to the technical field of artificial intelligence, in particular to the technical fields of deep learning, image processing and computer vision. The image processing model training method comprises the following steps: converting the first image features output by the teacher model into first probability distribution; converting the second image features output by the student model into second probability distribution; constructing a loss function based on the prior probability distribution of the student model, and the first probability distribution and the second probability distribution; and adjusting model parameters of the student model based on the loss function. The present disclosure may improve the accuracy of the trained student model.

Description

Image processing model training and image processing method, device, equipment and storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, and particularly relates to an image processing model training and image processing method, device, equipment and storage medium.

Background

Knowledge distillation (knowledge distillation) is a common method of model compression, and is different from pruning and quantization in model compression, and knowledge distillation is to train a small light model by using supervision information of a large model with better performance so as to achieve better performance and accuracy. This large model is called the teacher (teacher) model, and the small model is called the student (student) model. The process of learning the supervision information from the teacher model is called distillation (distillation).

Disclosure of Invention

The present disclosure provides an image processing model training and image processing method, apparatus, device and storage medium.

According to an aspect of the present disclosure, there is provided an image processing model training method including: converting the first image features output by the teacher model into first probability distribution; converting the second image features output by the student model into second probability distribution; constructing a loss function based on the prior probability distribution of the student model, and the first probability distribution and the second probability distribution; and adjusting model parameters of the student model based on the loss function.

According to another aspect of the present disclosure, there is provided an image processing method including: the acquisition module is used for acquiring the image to be processed; the extraction module is used for extracting image features of the image to be processed by adopting an image feature extraction model; the determining module is used for acquiring an image processing result of the image to be processed based on the image characteristics; wherein the image feature extraction model is a student model trained using the method of any of the above aspects.

According to another aspect of the present disclosure, there is provided an image processing model training apparatus including: the first conversion module is used for converting the first image characteristics output by the teacher model into first probability distribution; the second conversion module is used for converting second image features output by the student model into second probability distribution; the construction module is used for constructing a loss function based on the prior probability distribution of the student model, the first probability distribution and the second probability distribution; and the adjusting module is used for adjusting the model parameters of the student model based on the loss function.

According to another aspect of the present disclosure, there is provided an image processing apparatus including: acquiring an image to be processed; extracting image features of the image to be processed by adopting an image feature extraction model; acquiring an image processing result of the image to be processed based on the image characteristics; wherein the image feature extraction model is a student model trained using the method of any of the above aspects.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method according to any one of the above aspects.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the above aspects.

According to the technical scheme, the accuracy of the trained student model can be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of an image processing model training method provided by an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an application scenario for implementing an image processing model training method or an image processing method of an embodiment of the present disclosure;

FIG. 3 is a diagram of an image processing model training architecture provided by an embodiment of the present disclosure;

FIG. 4 is a flow chart of another image processing model training method provided by an embodiment of the present disclosure;

FIG. 5 is a flow chart of an image processing method provided by an embodiment of the present disclosure;

FIG. 6 is a block diagram of an image processing model training apparatus provided by an embodiment of the present disclosure;

fig. 7 is a block diagram of an image processing apparatus provided in an embodiment of the present disclosure;

fig. 8 is a schematic diagram of an electronic device for implementing an image processing model training method or an image processing method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the related art, knowledge distillation is to directly monitor the image features output by the student model, for example, an L2 loss function is constructed by adopting the image features output by the teacher model and the image features output by the student model, and model parameters of the learning model are adjusted based on the L2 loss function, so that the image features output by the student model are as close as possible to the image features output by the teacher model.

However, when the difference between the structure of the teacher model and the structure of the student model is large, the manner of directly supervising the image features output by the student model may result in poor accuracy of the trained student model.

In order to improve model accuracy, the present disclosure provides the following embodiments.

Fig. 1 is a flowchart of an image processing model training method according to an embodiment of the disclosure, as shown in fig. 1, where the method includes:

101. the first image features output by the teacher model are converted into first probability distribution.

102. And converting the second image features output by the student model into second probability distribution.

103. A loss function is constructed based on the prior probability distribution of the student model, and the first probability distribution and the second probability distribution.

104. And adjusting model parameters of the student model based on the loss function.

The teacher model and the student model are deep neural network models, compared with the student model, the structure of the teacher model is more complex than that of the student model, and the student model serving as the small model can learn knowledge of the teacher model serving as the large model through the knowledge distillation process, so that the performance of the student model is improved.

For the image processing field, the teacher model and the student model may be referred to as image processing models. Further, the models may be used for extracting image features, and may also be referred to as an image feature extraction model. In particular, the model structure may be a convolutional neural network (Convolutional Neural Networks, CNN) model, i.e., the teacher model may be a larger-scale CNN model and the student model is a smaller-scale CNN model. More specifically, the teacher model is, for example, a ResNet model, and the student model is, for example, a MobileNet model.

In the field of image processing, the input of the teacher model and the output of the student model are images, and the output is an image feature, and for distinguishing, the image feature output by the teacher model may be referred to as a first image feature, and the image feature output by the student model may be referred to as a second image feature.

In this embodiment, the first image feature may be converted into a first probability distribution and a second probability distribution, respectively, so as to construct a loss function based on the first probability distribution and the second probability distribution.

Wherein a normalization function may be employed to convert the image features into probability distributions.

The normalization function is for example a softmax function.

Wherein 101 and 102 are not in timing limitation.

When constructing the loss function, the prior probability distribution of the student model can be determined, and the loss function is constructed based on the prior probability distribution, the first probability distribution and the second probability distribution.

The prior probability distribution is the probability distribution which does not depend on observation data, the randomness of the student model can be expressed, and the flexibility and the accuracy of the trained student model can be improved by considering the prior probability distribution of the student model.

After obtaining the loss function, a general model parameter update algorithm, such as a Back Propagation (BP) algorithm, may be used to adjust model parameters of the student model.

In the embodiment, the image features are converted into probability distribution, the loss function is constructed based on the probability distribution, instead of directly constructing the loss function based on the image features, and due to the fact that the probability distribution is adopted, the student model can learn more knowledge, and accuracy of the student model is improved. In addition, the prior probability distribution of the student model is also considered in the construction of the loss function, so that the flexibility can be improved, and the precision of the student model is further improved.

In order to better understand the embodiments of the present disclosure, application scenarios to which the embodiments of the present disclosure are applicable are described below. The present embodiment takes the image processing field as an example. The image processing includes, for example: face recognition, target detection, target classification, etc.

Taking face recognition as an example, the final student model may be used for face recognition. For example, as shown in fig. 2, a user may install an Application (APP) capable of performing face recognition on a mobile device (such as a mobile phone), the APP may collect a face image through a face collecting device (such as a camera) on the mobile device, and then if the mobile device itself has face recognition capability, for example, the APP locally configures a student model for face recognition on the mobile device, the student model may be used to perform face recognition on the collected face image locally on the mobile device.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

It can be understood that, taking the face recognition at the mobile device side as an example, the above may also be that the student model is deployed at the server side, at this time, the APP may send the collected face image to the server side, and the server side performs face recognition on the received face image based on the configured student model for face recognition.

The left side of fig. 2 shows the application of a student model, for example, face recognition using the student model.

In order to apply the student model, the student model needs to be acquired first, and the student model can be obtained based on knowledge distillation training.

Wherein, as shown on the right side of fig. 2, the training process may be performed by the server 202, i.e., training of the student model is completed in the server, and the server may send the trained student model to the mobile device, so that the student model is adopted locally at the mobile device for face recognition.

The knowledge distillation architecture can comprise a teacher model and a student model, wherein the teacher model is a trained model with larger scale, and the student model is a model to be trained with smaller scale.

Taking face recognition as an example, the teacher model can perform feature extraction processing on the image sample to output first image features, and the student model can also perform feature extraction processing on the image sample to output second image features. The image feature may be embodied as a feature map (feature map).

The image sample may be from an existing sample set, such as ImageNet.

In addition, the image sample corresponding to the teacher model may be referred to as a first image, the image sample corresponding to the student model may be referred to as a second image, and the first image and the second image may be derived from the same image, for example, obtained by performing different data enhancement processing methods on the same image.

In this embodiment, the first image and the second image are from the same image, and the subsequent mutual information loss function is combined, so that the performance of the student model can be improved by minimizing the subsequent mutual information loss function by utilizing the characteristic of the maximum mutual information of the homologous data.

Because the number of samples in the sample set is limited, in order to obtain more samples, in this embodiment, different data enhancement processes may be performed on the same image sample to obtain a first image and a second image, where the first image is used as an input of a teacher model, and the second image is used as an input of a student model.

As shown in fig. 3, the same image sample may be referred to as an original image, and a first data enhancement process may be performed on the original image to obtain a first image, and a second data enhancement process may be performed on the original image to obtain a second image, where the first data enhancement process is different from the second data enhancement process.

The data enhancement processing includes, for example: clipping, rotation, occlusion, size transformation, modifying brightness, etc.

The first data enhancement process and the second data enhancement process may be the same type of process, e.g., both rotated, or both modified brightness.

In addition, the processing intensity of the second data enhancement processing is greater than that of the first data enhancement processing. Taking rotation as an example, the rotation angle of the second data enhancement process is larger than the rotation angle of the first data enhancement process.

Because the first data enhancement processing corresponds to the teacher model, and the second data enhancement processing corresponds to the student model, the performance of the teacher model is stronger than that of the student model, and therefore, even if a processing mode with lower intensity is adopted, the teacher model can also transmit more knowledge, and the student model corresponds to a processing mode with higher intensity, more knowledge can be learned, and the precision of the student model is improved.

The first image may be output after being input to the teacher model, and the second image may be output after being input to the student model. The image feature pairs may then be converted to probability distributions using softmax function pairs. As shown in fig. 3, a first probability distribution corresponding to the teacher model is denoted by t_logic, and a second probability distribution corresponding to the student model is denoted by s_logic.

Wherein a joint probability distribution may be constructed based on the first probability distribution (t_logic) and the second probability distribution (s_logic).

Alternatively, a gaussian distribution conforming to N (0, 1) can be obtained randomly and based on a priori gaussian distribution conforming to N (m, d), where m and d are learnable parameters.

Thereafter, a gaussian prior based mutual information loss function can be constructed based on the joint probability distribution, the prior gaussian distribution conforming to N (m, d), and the posterior probability distribution (i.e., s_logit).

After the loss function is obtained, the model parameters of the student model can be updated by adopting a BP algorithm, and the learnable parameters m and d are updated until the preset iteration times are reached, so that the final student model is obtained.

In conjunction with the architecture shown in fig. 3, the present disclosure also provides a model training method.

Fig. 4 is a flowchart of another image processing model training method according to an embodiment of the present disclosure, where the method provided by the present embodiment includes:

401. the original image is subjected to a first data enhancement process to obtain a first image.

402. And adopting a teacher model to perform feature extraction processing on the input first image so as to output first image features.

403. The first image feature is converted into a first probability distribution.

404. And performing second data enhancement processing on the original image to obtain a second image.

405. And adopting a student model to perform feature extraction processing on the input second image so as to output second image features.

406. The second image feature is converted into a second probability distribution.

Wherein the original image may be an image obtained in an existing sample set.

The first data enhancement process is different from the second data enhancement process so that two different first and second images can be obtained based on the same original image.

The first data enhancement process and the second data enhancement process may be the same class of data enhancement processes, e.g., both rotation operations.

In addition, the processing intensity of the second data enhancement processing may be greater than that of the first data enhancement processing, for example, the rotation angle of the second data enhancement processing is greater than that of the first data enhancement processing.

In this embodiment, by performing different data enhancement processing on the original image to obtain the first image and the second image, the first image and the second image can be obtained on the basis of a smaller sample size, so that the accuracy of the student model is improved, and the robustness of the student model can be improved.

The teacher model and the student model may both be deep neural network models for extracting image features, the teacher model being a trained model, and the student model being a model to be trained.

The teacher model is, for example, a ResNet model and the student model is, for example, a MobileNet model.

The input of the teacher model and the student model is an image, and the output is an image feature.

The image features may be converted to probability distributions using a softmax function.

Wherein 401-403 and 404-406 have no timing constraint relationship.

407. Constructing a joint probability distribution based on the first probability distribution and the second probability distribution.

The loss function employed in this embodiment is a mutual information loss function expressed as:

Mutual_loss＝p(s,t)*(log(p(s))+logp(t)-log(p(s,t))) (1)

where Mutual_loss is a Mutual information loss function, p (s, t) is a joint probability distribution of the teacher model and the student model, p(s) is a priori probability distribution of the student model, and p (t) is a priori probability distribution of the teacher model.

According to the Bayesian formula: p (s, t) =p (t) ×p (s|t), the following formula (1) can be obtained:

Mutual_loss＝p(s,t)*(log(p(s))-log(p(s|t))) (2)

wherein is the posterior probability distribution of the student model, the second probability distribution, in combination with figure 3,

p(s|t)＝s_logit。

thus, a loss function as shown in formula (2) can be constructed.

Since the loss function and the teacher model are related to the joint probability distribution of the student model, the prior probability distribution of the student model, and the posterior probability distribution, it is necessary to calculate the joint probability distribution.

The first probability distribution may be represented by t_logit, and the second probability distribution may be represented by s_logit, wherein the dimensions of t_logit and s_logit are (N, C), where N is the number of first images or second images and C is the dimension of a feature of a single image, such as the number of categories.

The calculation formula of the joint probability distribution may be:

p(s,t)＝t_logit ^T *s_logit ^T

the superscript T denotes a transpose operation.

Thus, the dimension of p (s, t) is (C, C).

Further, to obtain a more accurate joint probability distribution, the calculation formula of the joint probability distribution may be:

p1＝(s_logit.unsqueeze(2)*t_logit.unsqueeze(1)).sum(dim＝0)

p2＝(p1+p1.t())/2；

p(s,t)＝p2/p2.sum(dim＝0)；

where p1 is the similarity of t_logit and s_logit, unsqueeze () is an expansion operation, the dimension of s_logit.unsqueeze (2) is (N, C, 1), the dimension of t_logit.unsqueeze (1) is (N, 1, C), sum (dim=0) is summed for the first dimension, and thus the dimension of p1 is (C, C). p2 is an average operation for p1, and p1.T () is a transpose of p1, so the dimension of p2 is also (C, C). p (s, t) is normalized for p2, and the dimensions of p (s, t) are also (C, C).

Therefore, based on the above-described calculation formulas of p1, p2, p (s, t), the joint probability distribution p (s, t) can also be calculated.

408. An initial probability distribution of the student model is determined.

Wherein the student model may be randomly initialized with a gaussian distribution to obtain an initial probability distribution for the student model.

For example, as shown in FIG. 3, the initial probability distribution is a randomized Gaussian distribution function conforming to the N (0, 1) distribution.

It will be appreciated that the initial probability distribution may also be assumed to be a function of other distributions, such as a laplace distribution.

As the Gaussian distribution is a relatively common distribution function, all variable distribution conditions can be basically covered, the initial probability distribution of the student model is determined by adopting the Gaussian distribution, the actual prior distribution condition of the student model can be matched, and the accuracy of the student model is improved.

409. A priori probability distribution of the student model is determined based on the learnable distribution parameters and the initial probability distribution.

The calculation formula of the prior probability distribution of the student model may be:

p(s)＝m+d*a；

wherein p(s) is the prior probability distribution of the student model; m and d are learnable distribution parameters; a is an initializing probability distribution, for example, a gaussian distribution function conforming to an N (0, 1) distribution.

It will be appreciated that the prior probability distribution described above may be determined based on one or more sets of learnable distribution parameters.

For example, p(s) = Σ _i (m _i +d _i *a)

Wherein, (m) _i ,d _i ) Is the i (i=1, 2,3,..n) th group distribution parameter, N being the number of groups of distribution parameters. The above-described function addition corresponding to a plurality of sets of distribution parameters may be, for example, weighted addition.

In this embodiment, based on the learnable distribution parameters and the initial probability distribution, the prior probability distribution of the student model is determined, and because the distribution parameters are learnable, the distribution parameters can be adjusted in the training process, so that the student model better approximates to the optimal solution, and the accuracy of the student model is improved. In addition, the learnable distribution parameters are used for determining prior probability distribution, and the flexibility of model training can be improved due to the fact that prior process is acted on, and the accuracy of the student model is further improved.

For example, for a scene based on face image recognition age, the recognition result is that the user is a child, an adult and an elderly person, in general, the recognition effect of the adult is better, and the recognition effect of the child and the elderly person is worse.

Wherein 401-407 and 408-409 have no timing constraint relationship.

410. And constructing a mutual information loss function based on Gaussian distribution based on the joint probability distribution, the prior probability distribution and the posterior probability distribution of the student model.

The posterior probability distribution of the student model is the second probability distribution, i.e. s_logic in fig. 3, and p (s|t) in the above formula.

Based on the above formula (2), the loss function may be calculated using the joint probability distribution, the prior probability distribution, and the posterior probability distribution described above.

411. Model parameters of a model student model, and the learnable distribution parameters, are adjusted based on the loss function.

The training process may be divided into a plurality of iterative processes, and in each iterative process, a common parameter adjustment algorithm, such as a BP algorithm, may be used to adjust the model parameters and the distribution parameters until a preset number of iterations is reached. And taking the model parameters when the preset iteration times are reached as the model parameters of the finally generated student model.

In this embodiment, a bayesian formula is utilized, a joint probability distribution is constructed based on the first probability distribution and the second probability distribution, and a mutual information loss function is constructed based on the joint probability distribution, the prior probability distribution and the posterior probability distribution, so that the mutual information loss function in this embodiment is different from a general mutual information loss function, but is a mutual information loss function determined based on the prior probability distribution.

In this embodiment, the learnable distribution parameters are adjusted based on the loss function, so that the distribution parameters are adjustable in the training process of the student model, and flexibility and accuracy of the student model can be further improved.

The above describes a model training process by which a trained student model can be obtained. In the model application stage, student models can be used for image processing.

Fig. 5 is a flowchart of an image processing method according to an embodiment of the present disclosure, where, as shown in fig. 5, the image processing method includes:

501. and acquiring an image to be processed.

501. And extracting the image characteristics of the image to be processed by adopting an image characteristic extraction model.

503. And obtaining an image processing result based on the image characteristics.

The image feature extraction model may be a student model trained by any of the methods described above.

In the model application stage, an image extraction model can be adopted to extract image features, and then an image processing result is obtained based on the image features.

Taking face recognition as an example, the image to be processed may be a face image. Accordingly, the image features may be image features of a face image.

Based on the difference of application scenes, the image features can be input into a model of a related downstream task for processing so as to output an image processing result.

Still taking face recognition as an example, face recognition can be regarded as a classification task, and thus, image features can be input into a classification model, and the output of the classification model is a face recognition result, for example, a face image of which is determined among a plurality of candidates, or an age group of a user corresponding to the face image is identified. The specific structure of the classification model may be implemented using various related techniques, such as a fully connected network.

In this embodiment, the image feature extraction model is a student model obtained by adopting the training method, and because the accuracy of the trained student model is higher, the student model can obtain image features with higher accuracy, and further the accuracy of the image processing result can be improved.

Fig. 6 is a block diagram of an image processing model training apparatus according to an embodiment of the present disclosure, and as shown in fig. 6, the apparatus 600 includes: a first conversion module 601, a second conversion module 602, a construction module 603 and a determination module 604.

The first conversion module 601 is configured to convert a first image feature output by the teacher model into a first probability distribution; the second conversion module 602 is configured to convert a second image feature output by the student model into a second probability distribution; the construction module 603 is configured to construct a loss function based on the prior probability distribution of the student model, and the first probability distribution and the second probability distribution; the adjustment module 604 is configured to adjust model parameters of the student model based on the loss function.

In some embodiments, the loss function is a mutual information loss function, and the building module 603 is further configured to: constructing a joint probability distribution based on the first probability distribution and the second probability distribution; determining a priori probability distribution of the student model; taking the second probability distribution as a posterior probability distribution of the student model; and constructing the mutual information loss function based on the joint probability distribution, the prior probability distribution and the posterior probability distribution.

In some embodiments, the loss function is a mutual information loss function, and the building module 603 is further configured to: the building block 603 is further configured to: determining an initial probability distribution of the student model; a priori probability distribution of the student model is determined based on the learnable distribution parameters and the initial probability distribution.

In some embodiments, the apparatus 600 further comprises: and the learning module is used for adjusting the learnable distribution parameters based on the loss function.

The flexibility can be further improved and the accuracy of the student model can be improved by determining the prior probability distribution based on the learnable distribution parameters and adjusting the learnable distribution parameters based on the loss function, wherein the distribution parameters are adjustable in the training process of the student model.

In some embodiments, the building module 603 is further configured to: and randomly initializing the student model by adopting Gaussian distribution to obtain initial probability distribution of the student model.

In some embodiments, the apparatus 600 further comprises: the first feature extraction module is used for carrying out feature extraction processing on the input first image by adopting the teacher model so as to output the first image features; and/or a second feature extraction module, configured to perform feature extraction processing on an input second image by using the student model, so as to output the second image feature; wherein the first image and the second image are from the same image.

In some embodiments, the apparatus 600 further comprises: the first data enhancement module is used for carrying out first data enhancement processing on the original image so as to obtain the first image; the second data enhancement module is used for carrying out second data enhancement processing on the original image so as to obtain the second image; wherein the first data enhancement process is different from the second data enhancement process.

Fig. 7 is a block diagram of an image processing apparatus according to an embodiment of the present disclosure, and as shown in fig. 7, the apparatus 700 includes: an acquisition module 701, an extraction module 702 and a determination module 703.

The acquisition module 701 is used for acquiring an image to be processed; the extraction module 702 is configured to extract image features of the image to be processed by using an image feature extraction model; the determining module 703 is configured to obtain an image processing result of the image to be processed based on the image feature.

The image feature extraction model is a student model trained by the training method according to any one of the above.

It can be understood that "first", "second", etc. in the embodiments of the present disclosure are only used for distinguishing, and do not indicate the importance level, the time sequence, etc.

It is to be understood that in the embodiments of the disclosure, the same or similar content in different embodiments may be referred to each other.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in electronic device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 801 performs the respective methods and processes described above, such as a model training method or an image processing method. For example, in some embodiments, the model training method or the image processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the model training method or the image processing method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform a model training method or an image processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-chips (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An image processing model training method, comprising:

converting the first image features output by the teacher model into first probability distribution;

converting the second image features output by the student model into second probability distribution;

constructing a joint probability distribution based on the first probability distribution and the second probability distribution;

constructing a loss function based on the joint probability distribution, the prior probability distribution of the student model and the posterior probability distribution of the student model, wherein the loss function is a mutual information loss function; and adjusting model parameters of the student model based on the loss function.

2. The method of claim 1, wherein the constructing a loss function based on the joint probability distribution, the prior probability distribution of the student model, and the posterior probability distribution of the student model comprises:

determining a priori probability distribution of the student model;

taking the second probability distribution as a posterior probability distribution of the student model;

and constructing the mutual information loss function based on the joint probability distribution, the prior probability distribution and the posterior probability distribution.

3. The method of claim 2, wherein the determining the prior probability distribution of the student model comprises:

determining an initial probability distribution of the student model;

a priori probability distribution of the student model is determined based on the learnable distribution parameters and the initial probability distribution.

4. A method according to claim 3, further comprising:

based on the loss function, the learnable distribution parameters are adjusted.

5. A method according to claim 3, wherein said determining an initial probability distribution of the student model comprises:

and randomly initializing the student model by adopting Gaussian distribution to obtain initial probability distribution of the student model.

6. The method of any of claims 1-5, further comprising:

performing feature extraction processing on the input first image by adopting the teacher model so as to output the first image features; and/or the number of the groups of groups,

adopting the student model to perform feature extraction processing on the input second image so as to output the second image features;

wherein the first image and the second image are from the same image.

7. The method of claim 6, further comprising:

performing first data enhancement processing on an original image to obtain the first image;

performing second data enhancement processing on the original image to obtain the second image;

wherein the first data enhancement process is different from the second data enhancement process.

8. An image processing method, comprising:

acquiring an image to be processed;

extracting image features of the image to be processed by adopting an image feature extraction model;

acquiring an image processing result of the image to be processed based on the image characteristics;

wherein the image feature extraction model is a student model trained using the method of any one of claims 1-7.

9. An image processing model training apparatus comprising:

The first conversion module is used for converting the first image characteristics output by the teacher model into first probability distribution;

the second conversion module is used for converting second image features output by the student model into second probability distribution;

the construction module is used for constructing a loss function based on the prior probability distribution of the student model, the first probability distribution and the second probability distribution;

the adjusting module is used for adjusting model parameters of the student model based on the loss function;

wherein the loss function is a mutual information loss function, the construction module being further configured to:

and constructing the mutual information loss function based on the joint probability distribution, the prior probability distribution of the student model and the posterior probability distribution of the student model.

10. The apparatus of claim 9, wherein the build module is further to:

determining a priori probability distribution of the student model;

11. The apparatus of claim 10, wherein the build module is further to:

determining an initial probability distribution of the student model;

12. The apparatus of claim 11, further comprising:

and the learning module is used for adjusting the learnable distribution parameters based on the loss function.

13. The apparatus of claim 11, wherein the build module is further to:

14. The apparatus of any of claims 9-13, further comprising:

the first feature extraction module is used for carrying out feature extraction processing on the input first image by adopting the teacher model so as to output the first image features; and/or the number of the groups of groups,

the second feature extraction module is used for carrying out feature extraction processing on the input second image by adopting the student model so as to output the second image features;

wherein the first image and the second image are from the same image.

15. The apparatus of claim 14, further comprising:

the first data enhancement module is used for carrying out first data enhancement processing on the original image so as to obtain the first image;

the second data enhancement module is used for carrying out second data enhancement processing on the original image so as to obtain the second image;

16. An image processing apparatus comprising:

the acquisition module is used for acquiring the image to be processed;

the extraction module is used for extracting image features of the image to be processed by adopting an image feature extraction model;

the determining module is used for acquiring an image processing result of the image to be processed based on the image characteristics;

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.