CN116361658A

CN116361658A - Model training method, task processing method, device, electronic equipment and medium

Info

Publication number: CN116361658A
Application number: CN202310370283.8A
Authority: CN
Inventors: 付琰; 陈亮辉; 黎瑛; 范斌
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-04-07
Filing date: 2023-04-07
Publication date: 2023-06-30

Abstract

The disclosure provides a model training method, a task processing method, a device, electronic equipment and a medium, relates to the field of artificial intelligence, in particular to an image recognition and face recognition technology, and can be applied to smart city, city management and emergency management scenes. Comprising the following steps: extracting features of at least one first sample by using a teacher model and an initial student model to obtain first features and second features of each first sample; adjusting parameters of the initial student model based on differences of the first features and the second features of each first sample to obtain a transitional student model; extracting features of at least one sample pair by using a teacher model and a transition student model to obtain third features and fourth features of a second sample in each sample pair; determining a first similarity of a third feature and a second similarity of a fourth feature of two second samples in each sample pair; and adjusting parameters of the transition student model based on the corresponding first similarity and second similarity to obtain the target student model.

Description

Model training method, task processing method, device, electronic equipment and medium

Technical Field

The disclosure relates to the field of artificial intelligence, in particular to an image recognition and face recognition technology, which can be applied to smart cities, urban management and emergency management scenes.

Background

Larger scale models cannot be deployed on some equipment with limited computational resources, so that some means are required to compress the models, and in the related art, the compression of the models can be generally achieved through knowledge distillation, quantization, pruning and other means.

Knowledge distillation is a common model training means, and aims to distill knowledge of a teacher model with more parameters into a student model with fewer parameters, and the student model is deployed in actual use, so that required computing resources are reduced. However, when the network structure of the teacher model and the student model is greatly different, the efficiency of the training process of the model or the accuracy of the student model may be poor.

Disclosure of Invention

The disclosure provides a model training method, a task processing method, a device, electronic equipment and a medium.

According to a first aspect of the present disclosure, there is provided a model training method comprising:

extracting features of at least one first sample by using a teacher model and an initial student model to obtain first features and second features of each first sample, wherein the first features are extracted by the teacher model, and the second features are extracted by the initial student model;

Adjusting parameters of the initial student model based on differences of the first features and the second features of each first sample to obtain a transitional student model;

performing feature extraction on at least one sample pair by using a teacher model and a transition student model to obtain third features and fourth features of a second sample in each sample pair, wherein the sample pair comprises two second samples, the third features are extracted by the teacher model, and the fourth features are extracted by the transition student model;

determining a first similarity of a third feature of the two second samples in each sample pair and a second similarity of a fourth feature of the two second samples in each sample pair;

and adjusting parameters of the transition student model based on the first similarity and the second similarity corresponding to each sample pair to obtain a target student model.

In some embodiments of the present disclosure, the third feature is extracted by a feature extraction layer in the teacher model, and the fourth feature is extracted by a feature extraction layer in the initial student model;

adjusting parameters of the transitional student model based on the first similarity and the second similarity corresponding to each sample pair to obtain a target student model, including: and adjusting parameters of a feature extraction layer of the transitional student model based on the first similarity and the second similarity corresponding to each sample pair to obtain a target student model.

In some embodiments of the present disclosure, adjusting parameters of a feature extraction layer of a transitional student model based on a first similarity and a second similarity corresponding to each sample pair to obtain a target student model includes:

determining a second loss based on the difference between the first similarity and the second similarity corresponding to each sample pair and a preset second loss function;

and adjusting parameters of a feature extraction layer of the transitional student model based on each second loss to obtain a target student model.

In some embodiments of the present disclosure, adjusting parameters of a transitional student model based on a first similarity and a second similarity corresponding to each sample pair to obtain a target student model includes:

for each sample pair, determining a similarity difference value of a first similarity and a second similarity corresponding to the sample pair;

determining the sample pair as a target sample pair in response to the similarity difference of the sample pair being greater than a preset difference threshold;

training the transition student model by using the teacher model and at least one target sample pair to obtain a target student model.

In some embodiments of the present disclosure, training a transitional student model using a teacher model and at least one target sample pair to obtain a target student model includes:

Extracting features of at least one target sample pair by using a teacher model and a transition student model to obtain third features and fourth features of a second sample in each target sample pair;

determining first similarity of third features of the two second samples in each target sample pair and second similarity of fourth features of the two second samples in each target sample pair;

and adjusting parameters of the transition student model based on the first similarity and the second similarity corresponding to each target sample pair to obtain a target student model.

adjusting parameters of the transitional student model based on the first similarity and the second similarity corresponding to each target sample pair to obtain a target student model, including: and adjusting parameters of a feature extraction layer of the transitional student model based on the first similarity and the second similarity corresponding to each target sample pair to obtain the target student model.

In some embodiments of the present disclosure, adjusting parameters of a feature extraction layer of a transitional student model based on a first similarity and a second similarity corresponding to each target sample pair to obtain a target student model includes:

Determining a second loss based on the difference between the first similarity and the second similarity corresponding to each target sample pair and a preset second loss function;

In some embodiments of the present disclosure, the first feature is extracted by a feature extraction layer in the teacher model and the second feature is extracted by a feature extraction layer in the initial student model;

adjusting parameters of the initial student model based on differences in the first and second features of each first sample to obtain a transitional student model, comprising: and adjusting parameters of a feature extraction layer of the initial student model based on the difference between the first features and the second features of each first sample to obtain a transitional student model.

In some embodiments of the present disclosure, adjusting parameters of a feature extraction layer of an initial student model based on differences in the first feature and the second feature of each first sample to obtain a transitional student model includes:

determining a first loss based on the difference between the first feature and the second feature of each first sample and a preset first loss function;

and adjusting parameters of a feature extraction layer of the initial student model based on each first loss to obtain a transitional student model.

According to a second aspect of the present disclosure, there is provided a task processing method, including:

inputting data to be processed into a target student model, wherein the target student model is trained based on the model training method provided by the first aspect of the disclosure;

and outputting a corresponding processing result by using the target student model.

According to a third aspect of the present disclosure, there is provided a model training apparatus, including a first feature extraction module, a first parameter adjustment module, a second feature extraction module, a similarity comparison module, and a second parameter adjustment module;

the first feature extraction module is used for carrying out feature extraction on at least one first sample by utilizing a teacher model and an initial student model to obtain first features and second features of each first sample, wherein the first features are extracted by the teacher model, and the second features are extracted by the initial student model;

the first parameter adjustment module is used for adjusting parameters of the initial student model based on differences of the first characteristics and the second characteristics of each first sample to obtain a transitional student model;

the second feature extraction module is used for carrying out feature extraction on at least one sample pair by utilizing a teacher model and a transition student model to obtain third features and fourth features of second samples in each sample pair, wherein the sample pair comprises two second samples, the third features are extracted by the teacher model, and the fourth features are extracted by the transition student model;

The similarity comparison module is used for determining first similarity of third features of two second samples in each sample pair and second similarity of fourth features of the two second samples in each sample pair;

the second parameter adjustment module is used for adjusting parameters of the transition student model based on the first similarity and the second similarity corresponding to each sample pair to obtain the target student model.

the second parameter adjustment module is used for adjusting parameters of the transition student model based on the first similarity and the second similarity corresponding to each sample pair to obtain a target student model, and is specifically used for: and adjusting parameters of a feature extraction layer of the transitional student model based on the first similarity and the second similarity corresponding to each sample pair to obtain a target student model.

In some embodiments of the present disclosure, the second parameter adjustment module is specifically configured to, when adjusting parameters of the feature extraction layer of the transitional student model based on the first similarity and the second similarity corresponding to each sample pair to obtain the target student model:

In some embodiments of the present disclosure, the second parameter adjustment module is configured to, when adjusting parameters of the transitional student model based on the first similarity and the second similarity corresponding to each sample pair to obtain the target student model, specifically:

In some embodiments of the present disclosure, the second parameter adjustment module, when configured to train the transitional student model using the teacher model and the at least one target sample pair, is specifically configured to:

the second parameter adjustment module is used for adjusting parameters of the transition student model based on the first similarity and the second similarity corresponding to each target sample pair to obtain a target student model, and is specifically used for: and adjusting parameters of a feature extraction layer of the transitional student model based on the first similarity and the second similarity corresponding to each target sample pair to obtain the target student model.

In some embodiments of the present disclosure, the second parameter adjustment module is specifically configured to, when adjusting parameters of the feature extraction layer of the transitional student model based on the first similarity and the second similarity corresponding to each target sample pair to obtain the target student model:

the first parameter adjustment module is used for adjusting parameters of the initial student model based on differences of the first characteristic and the second characteristic of each first sample to obtain a transitional student model, and is specifically used for: and adjusting parameters of a feature extraction layer of the initial student model based on the difference between the first features and the second features of each first sample to obtain a transitional student model.

In some embodiments of the present disclosure, the first parameter adjustment module, when configured to adjust parameters of a feature extraction layer of the initial student model based on differences between the first feature and the second feature of each first sample, is specifically configured to:

According to a fourth aspect of the present disclosure, there is provided a task processing device including a data input module and a result output module;

the data input module is used for inputting data to be processed into a target student model, wherein the target student model is obtained by training based on the model training method provided by the first aspect of the disclosure;

the result output module is used for outputting corresponding processing results by using the target student model.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as provided in the first aspect or the method as provided in the second aspect.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method provided in the first aspect or the method provided in the second aspect.

According to a seventh aspect of the present disclosure, there is provided a computer program item comprising a computer program which, when executed by a processor, implements the method provided by the first aspect or the method provided by the second aspect.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

The beneficial effects that this disclosure provided technical scheme brought are:

according to the model training method provided by the embodiment of the disclosure, firstly, the student model is trained based on the difference of the characteristics extracted from the same sample by the teacher model and the student model, so that model convergence can be accelerated, and the model training speed is improved; and then, training the student model by using the difference of the similarity of the teacher model and the student model to the extracted characteristics of the same sample, so that the learning ability of the student model to the difference between different pictures is enhanced, and the accuracy of the student model is improved.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 illustrates a flow diagram of a model training method provided by the present disclosure;

FIG. 2 shows a specific flow diagram of S150 provided by the present disclosure;

FIG. 3 illustrates a schematic diagram of determining a first penalty in an image recognition scenario provided by the present disclosure;

FIG. 4 illustrates a schematic diagram of determining a second penalty in an image recognition scenario provided by the present disclosure;

FIG. 5 illustrates a flow diagram of a task processing method provided by the present disclosure;

FIG. 6 shows a schematic diagram of a model training apparatus provided by the present disclosure;

FIG. 7 illustrates a schematic diagram of a task processing device provided by the present disclosure;

fig. 8 shows a schematic block diagram of an example electronic device that may be used to implement the present disclosure.

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be appreciated that in embodiments of the present disclosure, the character "/" generally indicates that the context associated object is an "or" relationship. The terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated.

The method may be performed by a terminal device, or by a computer, or by a server, or by other devices having data processing capabilities. The subject of execution of the method is not limited herein. In some embodiments, the execution subject of the model training method provided in the embodiments of the present disclosure may be a terminal device (e.g., a vehicle-mounted computer) on a host vehicle.

Optionally, the terminal device may be a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (personal digital assistant, PDA), or the like, and the specific type of the terminal device is not limited in the embodiments of the present disclosure.

In some embodiments, the server may be a single server, or may be a server cluster formed by a plurality of servers. In some implementations, the server cluster may also be a distributed cluster. The present disclosure is not limited to a specific implementation of the server.

The model training method provided in the present disclosure is exemplarily described below.

The present disclosure uses knowledge distillation to train models, and in particular, improves the performance of small models (e.g., student models) by distilling knowledge of larger models (e.g., teacher models) during the training process. Here, the specific types of the teacher model and the student model may depend on the actual use scenario, for example, in the image recognition scenario, the teacher model and the student model are both image recognition models; in the speech recognition scenario, both the teacher model and the student model are speech recognition models.

Fig. 1 shows a flow chart of a model training method provided by the present disclosure, and as shown in fig. 1, the method may mainly include the following steps:

s110: and extracting the characteristics of at least one first sample by using the teacher model and the initial student model to obtain a first characteristic and a second characteristic of each first sample.

It should be noted that, in the present disclosure, a training set may be preconfigured, where the training set includes a plurality of samples, and the types of the samples may depend on an actual usage scenario of the model, for example, in an image recognition scenario, the samples in the training set are pictures; in a speech recognition scenario, the samples in the training set are speech segments. For ease of understanding and description, the present disclosure defines the samples referred to at 110 and S120 as the first sample.

Here, the first feature is extracted in the first sample by the teacher model, and the second feature is extracted in the first sample by the initial student model. Specifically, for each first sample, the first sample is subjected to feature extraction by using a teacher model to obtain a first feature, and the first sample is also required to be subjected to feature extraction by using a student model to obtain a second feature, that is, each first sample is extracted with one first feature and one second feature.

It will be appreciated that the model may include a feature extraction layer and a feature processing layer, where the specific type of feature processing layer may depend on the actual application scenario, and taking the image recognition scenario as an example, the feature processing layer in the model may be a classifier. In some embodiments of the present disclosure, the first feature is extracted by a feature extraction layer in the teacher model and the second feature is extracted by a feature extraction layer in the initial student model.

S120: and adjusting parameters of the initial student model based on the difference between the first characteristic and the second characteristic of each first sample to obtain a transitional student model.

In S120, a first loss may be determined according to the difference between the first feature and the second feature of each first sample and a preset first loss function, and parameters of the initial student model are adjusted based on each first loss, so as to obtain a transitional student model.

As described above, the first feature is extracted by the feature extraction layer in the teacher model, and the second feature is extracted by the feature extraction layer in the initial student model, and since only the learning process of the feature extraction layer is involved here, only the parameters of the feature extraction layer need to be adjusted in S120, so that the computing resources can be reasonably utilized, and the training efficiency of the model can be improved. Optionally, in S120, parameters of a feature extraction layer of the initial student model may be adjusted based on differences between the first feature and the second feature of each first sample, resulting in a transitional student model. Specifically, the first loss may be determined based on a difference between the first feature and the second feature of each first sample and a preset first loss function; and adjusting parameters of a feature extraction layer of the initial student model based on each first loss to obtain a transitional student model. Here, the first loss function may be an absolute value loss function (L1 loss), a mean loss function (MSE loss), a divergence loss function (klloss), or the like.

S130: and extracting the characteristics of at least one sample pair by using the teacher model and the transition student model to obtain the third characteristics and the fourth characteristics of the second sample in each sample pair.

It should be noted that, in the present disclosure, a training set may be preconfigured, where the training set includes a plurality of samples, and the types of the samples may depend on an actual usage scenario of the model, for example, in an image recognition scenario, the samples in the training set are pictures; in a speech recognition scenario, the samples in the training set are speech segments. For ease of understanding and description, the present disclosure defines the samples referred to at 130 and S140 as the second sample. The training set used in S130 may be the same as the training set used in S110, and of course, the training set used in S130 may be different from the training set used in S110, but the training set used in S130 is the same as the training set used in S110 in sample types, for example, the samples are all pictures.

Here, the pair of samples includes two second samples, which may be any two samples in the training set. The third feature is extracted from the second sample by the teacher model, and the fourth feature is extracted from the second sample by the intermediate student model.

Specifically, for each sample pair, extracting the characteristics of each second sample in the sample pair by using a teacher model to obtain a third characteristic of each second sample; it is also necessary to extract features from each second sample in the pair using the student model to obtain a fourth feature for each second sample. That is, each second sample is extracted with a third feature and a fourth feature.

It will be appreciated that the model may include a feature extraction layer and a feature processing layer, where the specific type of feature processing layer may depend on the actual application scenario, and taking the image recognition scenario as an example, the feature processing layer in the model may be a classifier. In some embodiments of the present disclosure, the third feature is extracted by a feature extraction layer in the teacher model and the fourth feature is extracted by a feature extraction layer in the initial student model.

S140: first similarities of third features of the two second samples in each sample pair and second similarities of fourth features of the two second samples in each sample pair are determined.

It will be appreciated that each second sample in each sample pair is extracted by the teacher model to a third feature and by the transitional student model to a fourth feature. For each sample pair, the similarity of the third features of the two second samples in the sample pair and the similarity of the fourth features of the two second samples in the sample pair can be determined by a preset feature comparison method.

For ease of understanding and description, the similarity of the third features of the two second samples in the pair of samples is defined as a first similarity, and the similarity of the fourth features of the two second samples in the pair of samples is defined as a second similarity, wherein the types of the first similarity and the second similarity are the same, and the two may be cosine similarity or distance similarity, etc.

S150: and adjusting parameters of the transition student model based on the first similarity and the second similarity corresponding to each sample pair to obtain a target student model.

It will be appreciated that in the training of the transitional student model, it is desirable that the first similarity and the second similarity corresponding to each sample pair be as close as possible, so in S150, parameters of the transitional student model may be adjusted based on the difference between the first similarity and the second similarity corresponding to each sample pair, and finally, the target student model is obtained.

As described above, the third feature is extracted by the feature extraction layer in the teacher model, and the fourth feature is extracted by the feature extraction layer in the initial student model, and since only the learning process of the feature extraction layer is involved here, only the parameters of the feature extraction layer need to be adjusted in S150, so that the computing resources can be reasonably utilized, and the training efficiency of the model can be improved. Optionally, in S150, parameters of a feature extraction layer of the transitional student model may be adjusted based on the first similarity and the second similarity corresponding to each sample pair, to obtain the target student model. Specifically, the second loss may be determined based on a difference between the first similarity and the second similarity corresponding to each sample pair and a preset second loss function; and adjusting parameters of a feature extraction layer of the transitional student model based on each second loss to obtain a target student model. Here, the second loss function may be an absolute value loss function (L1 loss), a mean loss function (MSE loss), a divergence loss function (klloss), or the like.

Optionally, in S150, sample pairs may be screened based on the comparison of the first similarity and the second similarity corresponding to each sample pair, and then the transitional student model may be trained based on the screened sample pairs. Fig. 2 shows a specific flow diagram of S150 provided in the present disclosure, and as shown in fig. 2, S150 may be mainly divided into the following steps:

S210: for each sample pair, a similarity difference between a first similarity and a second similarity corresponding to the sample pair is determined.

The present disclosure may perform a difference calculation on the corresponding first similarity and second similarity by using samples, so as to obtain a similarity difference between the first similarity and the second similarity, and it may be understood that the similarity difference is a quantized representation of a difference between the first similarity and the second similarity, and in a training process of the transitional student model, it is desirable that the similarity difference between the first similarity and the second similarity corresponding to each sample is as small as possible.

S220: and determining the sample pair as a target sample pair in response to the similarity difference of the sample pair being greater than a preset difference threshold.

The disclosure may preset a difference threshold, and the specific value of the difference threshold may be determined according to the actual design requirement, and the difference threshold may be used to evaluate the degree of difference between the first similarity and the second similarity.

Specifically, if the similarity difference between the first similarity and the second similarity corresponding to the sample pair is greater than a preset difference threshold, the difference between the first similarity and the second similarity is larger, and the accuracy of the fourth feature extracted by the transition student model for the two second samples in the sample pair is lower; if the similarity difference value of the first similarity and the second similarity corresponding to the sample pair is smaller than or equal to a preset difference threshold value, the fact that the difference between the first similarity and the second similarity is smaller is indicated, and the accuracy of the fourth feature extracted by the transition student model on the two second samples in the sample pair is higher.

For the sample pair with the corresponding similarity difference value larger than the difference threshold, because the accuracy of the fourth feature extracted by the transition student model for the two second samples in the sample pair is lower, the sample pair can be determined as a target sample pair, all target sample pairs are constructed into a new training set, and the transition student model can be continuously trained by using the new training set in the subsequent step. Based on the similarity difference value of the first similarity and the second similarity corresponding to the sample pairs, the target sample pairs with poor characteristic effects are screened out, and then training corresponding to the transition student models is performed based on the target sample pairs, so that the number of the sample pairs for training the models can be reduced, calculation resources are saved, the training efficiency of the models is improved, and the model convergence speed is accelerated.

S230: training the transition student model by using the teacher model and at least one target sample pair to obtain a target student model.

As previously indicated, all the target sample pairs are structured as a new training set, and in S230, at least one target sample pair may be sequentially selected from the new training set to train the transitional student model, and the transitional student model obtains the target student model after completing at least one training.

Optionally, in S230, feature extraction may be performed on at least one target sample pair using a teacher model and a transitional student model, resulting in third and fourth features of the second sample in each target sample pair.

It will be appreciated that the target sample pair includes two second samples, which may be any two samples in the training set. The third feature is extracted from the second sample by the teacher model, and the fourth feature is extracted from the second sample by the intermediate student model.

Specifically, for each target sample pair, extracting features of each second sample in the target sample pair by using a teacher model to obtain a third feature of each second sample; it is also necessary to use the student model to extract features from each second sample in the target sample pair to obtain a fourth feature of each second sample. That is, each second sample is extracted with a third feature and a fourth feature.

It will be appreciated that the model may include a feature extraction layer and a feature processing layer, where the specific type of feature processing layer may depend on the actual application scenario, and taking the image recognition scenario as an example, the feature processing layer in the model may be a classifier. In some embodiments of the present disclosure, the third feature is extracted by a feature extraction layer in the teacher model and the fourth feature is extracted by a feature extraction layer in the transitional student model.

In S230, after obtaining the third and fourth features of the second samples in each target sample pair, the first similarity of the third features of the two second samples in each target sample pair, the second similarity of the fourth features of the two second samples in each target sample pair may be determined.

It will be appreciated that each second sample in each target sample pair is extracted by the teacher model to a third feature and by the transitional student model to a fourth feature. For each target sample pair, the similarity of the third features of the two second samples in the target sample pair and the similarity of the fourth features of the two second samples in the target sample pair can be determined through a preset feature comparison method.

For ease of understanding and description, the similarity of the third features of the two second samples in the target sample pair is defined as a first similarity, and the similarity of the fourth features of the two second samples in the target sample pair is defined as a second similarity, wherein the types of the first similarity and the second similarity are the same, and the two may be cosine similarity or distance similarity, etc.

After determining the first similarity and the second similarity corresponding to each target sample pair, parameters of the transition student model can be adjusted based on the first similarity and the second similarity corresponding to each target sample pair, and the target student model is obtained. Optionally, parameters of a feature extraction layer of the transitional student model may be adjusted based on the first similarity and the second similarity corresponding to each target sample pair, so as to obtain a target student model. Specifically, the second loss may be determined based on a difference between the first similarity and the second similarity corresponding to each target sample pair and a preset second loss function; and adjusting parameters of a feature extraction layer of the transitional student model based on each second loss to obtain a target student model. Here, the second loss function may be an absolute value loss function (L1 loss), a mean loss function (MSE loss), a divergence loss function (klloss), or the like.

The training process for performing a student model of an image recognition scene will be described below with reference to fig. 3 and 4 by taking the image recognition scene as an example.

Fig. 3 is a schematic diagram illustrating determining a first loss in the image recognition scenario provided by the present disclosure, in fig. 3, a picture a is the first sample, a feature extraction layer of a teacher model is used to extract a first feature of the picture a, a feature extraction layer of an initial student model is used to extract a second feature of the picture a, the first loss is determined based on the first feature and the second feature of the picture a, and then parameters of the feature extraction layer of the initial student model are adjusted based on the first loss, so as to obtain a transitional student model.

Fig. 4 is a schematic diagram showing determining a second loss in the image recognition scenario provided in the present disclosure, in fig. 4, a picture a and a picture b are taken as a sample pair, a third feature of a picture a and a third feature of a picture b are extracted by using a feature extraction layer of a teacher model, and a fourth feature of a picture a and a fourth feature of a picture b are extracted by using a feature extraction layer of an initial student model; then determining first similarity of the third feature of the picture a and the third feature of the picture b, and determining second similarity of the fourth feature of the picture a and the fourth feature of the picture b; and determining a second loss based on the first similarity and the second similarity, and adjusting parameters of a feature extraction layer of the transitional student model based on the second loss to obtain the target student model.

Fig. 5 shows a flow chart of a task processing method provided by the present disclosure, and as shown in fig. 5, the method may mainly include the following steps:

s510: and inputting the data to be processed into the target student model.

It should be noted that the target student model is trained based on the model training method disclosed above. The specific type of the target student model can be determined according to the actual use scene, for example, in an image recognition scene, the target student model is an image recognition model, and the data to be processed is a picture; in the voice recognition scene, the target student model is a voice recognition model, and the data to be processed is a voice fragment.

S520: and outputting a corresponding processing result by using the target student model.

In S520, after the target student model performs steps such as feature extraction and special diagnosis processing on the data to be processed, a corresponding processing result may be output. Taking an image recognition scene as an example, the target student model can output a picture recognition result; taking a speech recognition scenario as an example, the target student model may output a speech recognition result.

Based on the same principle as the model training method described above, an embodiment of the present disclosure provides a model training apparatus, and fig. 6 shows a schematic diagram of the model training apparatus provided by the present disclosure, as shown in fig. 6, the model training apparatus 600 includes a first feature extraction module 610, a first parameter adjustment module 620, a second feature extraction module 630, a similarity comparison module 640, and a second parameter adjustment module 650.

The first feature extraction module 610 is configured to perform feature extraction on at least one first sample by using a teacher model and an initial student model, so as to obtain a first feature and a second feature of each first sample, where the first feature is extracted by the teacher model and the second feature is extracted by the initial student model.

The first parameter adjustment module 620 is configured to adjust parameters of the initial student model based on differences between the first feature and the second feature of each first sample, resulting in a transitional student model.

The second feature extraction module 630 is configured to perform feature extraction on at least one sample pair by using a teacher model and a transitional student model, so as to obtain a third feature and a fourth feature of the second samples in each sample pair, where the sample pair includes two second samples, the third feature is extracted by the teacher model, and the fourth feature is extracted by the transitional student model.

The similarity comparison module 640 is configured to determine a first similarity of the third features of the two second samples in each sample pair, and a second similarity of the fourth features of the two second samples in each sample pair.

The second parameter adjustment module 650 is configured to adjust parameters of the transitional student model based on the first similarity and the second similarity corresponding to each sample pair, so as to obtain a target student model.

According to the model training device provided by the embodiment of the disclosure, firstly, the student model is trained based on the difference of the characteristics extracted from the same sample by the teacher model and the student model, so that model convergence can be accelerated, and the model training speed is improved; and then, training the student model by using the difference of the similarity of the teacher model and the student model to the extracted characteristics of the same sample, so that the learning ability of the student model to the difference between different pictures is enhanced, and the accuracy of the student model is improved.

the second parameter adjustment module 650 is specifically configured to, when adjusting parameters of the transitional student model based on the first similarity and the second similarity corresponding to each sample pair to obtain the target student model: and adjusting parameters of a feature extraction layer of the transitional student model based on the first similarity and the second similarity corresponding to each sample pair to obtain a target student model.

In some embodiments of the present disclosure, the second parameter adjustment module 650 is specifically configured to, when adjusting parameters of the feature extraction layer of the transitional student model based on the first similarity and the second similarity corresponding to each sample pair, obtain the target student model:

In some embodiments of the present disclosure, the second parameter adjustment module 650 is specifically configured to, when adjusting parameters of the transitional student model based on the first similarity and the second similarity corresponding to each sample pair to obtain the target student model:

In some embodiments of the present disclosure, the second parameter adjustment module 650, when configured to train the transitional student model using the teacher model and the at least one target sample pair, is specifically configured to:

the second parameter adjustment module 650 is specifically configured to, when adjusting parameters of the transition student model based on the first similarity and the second similarity corresponding to each target sample pair to obtain a target student model: and adjusting parameters of a feature extraction layer of the transitional student model based on the first similarity and the second similarity corresponding to each target sample pair to obtain the target student model.

In some embodiments of the present disclosure, the second parameter adjustment module 650 is specifically configured to, when adjusting parameters of the feature extraction layer of the transitional student model based on the first similarity and the second similarity corresponding to each target sample pair, obtain the target student model:

the first parameter adjustment module 620 is specifically configured to, when adjusting parameters of the initial student model based on differences between the first feature and the second feature of each first sample to obtain a transitional student model: and adjusting parameters of a feature extraction layer of the initial student model based on the difference between the first features and the second features of each first sample to obtain a transitional student model.

In some embodiments of the present disclosure, the first parameter adjustment module 620 is specifically configured to, when adjusting parameters of the feature extraction layer of the initial student model based on differences between the first feature and the second feature of each first sample, obtain a transitional student model:

It can be understood that the above modules of the model training apparatus in the embodiments of the present disclosure have functions of implementing the corresponding steps of the above model training method. The functions can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above. The modules may be software and/or hardware, and each module may be implemented separately or may be implemented by integrating multiple modules. For the functional description of each module of the model training apparatus, reference may be made to the corresponding description of the model training method, which is not repeated herein.

Based on the same principle as the task processing method described above, an embodiment of the present disclosure provides a task processing device, and fig. 7 shows a schematic diagram of the task processing device provided by the present disclosure, and as shown in fig. 7, the task processing device 700 includes a data input module and a result output module.

The data input module 710 is configured to input data to be processed into a target student model, where the target student model is trained based on the model training method provided in the present disclosure.

The result output module 720 is configured to output a corresponding processing result by using the target student model.

It will be appreciated that the above modules of the task processing device in the embodiments of the present disclosure have functions of implementing the corresponding steps of the task processing method described above. The functions can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above. The modules may be software and/or hardware, and each module may be implemented separately or may be implemented by integrating multiple modules. For the functional description of each module of the task processing device, reference may be specifically made to the corresponding description of the task processing method, which is not repeated herein.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to an embodiment of the disclosure, the disclosure further provides an electronic device, a readable storage medium, a computer program item.

In an exemplary embodiment, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the above embodiments. The electronic device may be the computer or server described above.

In an exemplary embodiment, the readable storage medium may be a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method according to the above embodiment.

In an exemplary embodiment, the computer program item comprises a computer program which, when executed by a processor, implements a method according to the above embodiments.

Fig. 8 shows a schematic block diagram of an example electronic device that may be used to implement the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in electronic device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running artificial intelligence model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, such as a model training method or a task processing method. For example, in some embodiments, the model training method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the model training method or the task processing method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform a model training method or a task processing method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Programming (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of model training, the method comprising:

performing feature extraction on at least one first sample by using a teacher model and an initial student model to obtain first features and second features of each first sample, wherein the first features are extracted by the teacher model, and the second features are extracted by the initial student model;

Performing feature extraction on at least one sample pair by using the teacher model and the transition student model to obtain a third feature and a fourth feature of a second sample in each sample pair, wherein the sample pair comprises two second samples, the third feature is extracted by the teacher model, and the fourth feature is extracted by the transition student model;

determining first similarity of third features of two second samples in each sample pair and second similarity of fourth features of two second samples in each sample pair;

2. The method of claim 1, wherein the third feature is extracted by a feature extraction layer in the teacher model and the fourth feature is extracted by a feature extraction layer in the initial student model;

the step of adjusting parameters of the transition student model based on the first similarity and the second similarity corresponding to each sample pair to obtain a target student model comprises the following steps: and adjusting parameters of a feature extraction layer of the transition student model based on the first similarity and the second similarity corresponding to each sample pair to obtain a target student model.

3. The method of claim 2, wherein the adjusting parameters of the feature extraction layer of the transitional student model based on the first similarity and the second similarity corresponding to each of the pairs of samples to obtain a target student model comprises:

and adjusting parameters of a feature extraction layer of the transition student model based on each second loss to obtain a target student model.

4. The method of claim 1, wherein the adjusting parameters of the transitional student model based on the first similarity and the second similarity corresponding to each of the pairs of samples to obtain a target student model comprises:

and training the transition student model by using the teacher model and at least one target sample pair to obtain a target student model.

5. The method of claim 4, wherein training the transitional student model using the teacher model and at least one of the target sample pairs results in a target student model, comprising:

extracting features of at least one target sample pair by using the teacher model and the transition student model to obtain third features and fourth features of a second sample in each target sample pair;

determining first similarity of third features of two second samples in each target sample pair and second similarity of fourth features of two second samples in each target sample pair;

6. The method of claim 5, wherein the third feature is extracted by a feature extraction layer in the teacher model and the fourth feature is extracted by a feature extraction layer in the initial student model;

the step of adjusting parameters of the transition student model based on the first similarity and the second similarity corresponding to each target sample pair to obtain a target student model comprises the following steps: and adjusting parameters of a feature extraction layer of the transition student model based on the first similarity and the second similarity corresponding to each target sample pair to obtain a target student model.

7. The method of claim 6, wherein the adjusting parameters of the feature extraction layer of the transitional student model based on the first similarity and the second similarity corresponding to each of the target sample pairs to obtain a target student model comprises:

8. The method of claim 1, wherein the first feature is extracted by a feature extraction layer in the teacher model and the second feature is extracted by a feature extraction layer in the initial student model;

the step of adjusting parameters of the initial student model based on differences between the first features and the second features of each first sample to obtain a transitional student model comprises the following steps: and adjusting parameters of a feature extraction layer of the initial student model based on differences of the first features and the second features of each first sample to obtain a transitional student model.

9. The method of claim 8, wherein the adjusting parameters of the feature extraction layer of the initial student model based on differences in the first and second features of each of the first samples results in a transitional student model, comprising:

Determining a first loss based on the difference between the first characteristic and the second characteristic of each first sample and a preset first loss function;

10. A method of task processing, the method comprising:

inputting data to be processed into a target student model, wherein the target student model is trained based on the model training method according to any one of claims 1-9;

11. A model training apparatus, the apparatus comprising:

the first parameter adjustment module is used for adjusting parameters of the initial student model based on differences of first features and second features of each first sample to obtain a transition student model;

A second feature extraction module for feature extracting at least one sample pair using the teacher model and the transition student model, obtaining a third feature and a fourth feature of a second sample in each of the sample pairs, wherein the sample pair includes two second samples, the third feature is extracted by the teacher model, and the fourth feature is extracted by the transition student model;

a similarity comparison module for determining a first similarity of a third feature of two second samples in each of the sample pairs, a second similarity of a fourth feature of two second samples in each of the sample pairs;

and the second parameter adjustment module is used for adjusting parameters of the transition student model based on the first similarity and the second similarity corresponding to each sample pair to obtain a target student model.

12. The apparatus of claim 11, wherein the third feature is extracted by a feature extraction layer in the teacher model and the fourth feature is extracted by a feature extraction layer in the initial student model;

The second parameter adjustment module is configured to, when adjusting parameters of the transition student model based on the first similarity and the second similarity corresponding to each pair of samples to obtain a target student model, specifically: and adjusting parameters of a feature extraction layer of the transition student model based on the first similarity and the second similarity corresponding to each sample pair to obtain a target student model.

13. The apparatus of claim 12, wherein the second parameter adjustment module is configured to, when adjusting parameters of the feature extraction layer of the transition student model based on the first similarity and the second similarity corresponding to each of the pairs of samples, obtain a target student model, specifically:

14. The apparatus of claim 11, wherein the second parameter adjustment module, when configured to adjust parameters of the transition student model based on the first similarity and the second similarity corresponding to each of the pairs of samples, is specifically configured to:

15. The apparatus of claim 14, wherein the second parameter adjustment module, when configured to train the transitional student model using the teacher model and at least one of the target sample pairs, is specifically configured to:

16. The apparatus of claim 15, wherein the third feature is extracted by a feature extraction layer in the teacher model and the fourth feature is extracted by a feature extraction layer in the initial student model;

the second parameter adjustment module is configured to, when obtaining a target student model, adjust parameters of the transition student model based on the first similarity and the second similarity corresponding to each of the target sample pairs, specifically be: and adjusting parameters of a feature extraction layer of the transition student model based on the first similarity and the second similarity corresponding to each target sample pair to obtain a target student model.

17. A task processing device, the device comprising:

a data input module for inputting data to be processed into a target student model, wherein the target student model is trained based on the model training method of any one of claims 1 to 9;

and the result output module is used for outputting a corresponding processing result by utilizing the target student model.

18. An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor;

Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9 or to perform the method of claim 10.

19. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-9 or to perform the method of claim 10.

20. Computer program item comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-9 or implements the method according to claim 10.