CN113033557A

CN113033557A - Method and device for training image processing model and detecting image

Info

Publication number: CN113033557A
Application number: CN202110411530.5A
Authority: CN
Inventors: 杨馥魁
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-04-16
Filing date: 2021-04-16
Publication date: 2021-06-25

Abstract

The disclosure discloses a method and a device for training an image processing model and detecting an image, and relates to the technical field of artificial intelligence, in particular to the technical field of computer vision technology and deep learning. The specific implementation scheme is as follows: acquiring at least two first local features and at least two corresponding second local features, wherein the first local features and the second local features are obtained by performing feature extraction on a target region in a target image based on a pre-trained first image processing model and a corresponding to-be-trained second image processing model; inputting the at least two first local features and the corresponding second local features into corresponding local relation extraction models to be trained respectively to generate first local relation features and second local relation features; and adjusting parameters of the second image processing model to be trained and the local relation extraction model based on a loss value generated according to the first local relation characteristic and the second local relation characteristic to obtain a trained second image processing model.

Description

Method and device for training image processing model and detecting image

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and more particularly, to computer vision technology and deep learning technology.

Background

Knowledge Distillation (KD) is a common method of model compression. The method is to migrate the knowledge of a teacher network (teacher network) to a student network (student network) so that the performance of the student network is similar to that of the teacher network.

Disclosure of Invention

A method, apparatus, electronic device, and storage medium for training an image processing model and detecting an image are provided.

According to a first aspect, there is provided a method for training an image processing model, the method comprising: the method comprises the steps of obtaining at least two first local features and at least two corresponding second local features, wherein the first local features are obtained by performing feature extraction on a first target region in a target image based on a pre-trained first image processing model, and the second local features are obtained by performing feature extraction on a second target region in the target image based on a to-be-trained second image processing model corresponding to the first image processing model; respectively inputting at least two first local features and at least two corresponding second local features into corresponding local relation extraction models to be trained to generate first local relation features and second local relation features, wherein the first local relation features are used for representing incidence relations between the at least two first local features, and the second local relation features are used for representing incidence relations between the at least two second local features; generating a loss value based on the first local relational feature and the second local relational feature by using a preset loss function; and adjusting network parameters of the second image processing model to be trained and the local relation extraction model to be trained based on the generated loss value to obtain the trained second image processing model.

According to a second aspect, there is provided a method for detecting an image, the method comprising: acquiring an image to be detected; inputting an image to be detected into a pre-trained image processing model, and generating an object detection result corresponding to the image to be detected, wherein the image processing model is obtained by training according to the method described in any one of the implementation manners of the first aspect.

According to a third aspect, there is provided an apparatus for training an image processing model, the apparatus comprising: the local feature acquisition unit is configured to acquire at least two first local features and at least two corresponding second local features, wherein the first local features are obtained by performing feature extraction on a first target region in a target image based on a pre-trained first image processing model, and the second local features are obtained by performing feature extraction on a second target region in the target image based on a to-be-trained second image processing model corresponding to the first image processing model; the relation feature generation unit is configured to input at least two first local features and at least two corresponding second local features into corresponding local relation extraction models to be trained respectively, and generate the first local relation features and the second local relation features, wherein the first local relation features are used for representing incidence relations between the at least two first local features, and the second local relation features are used for representing incidence relations between the at least two second local features; a loss value generation unit configured to generate a loss value based on the first local relationship feature and the second local relationship feature using a preset loss function; and the parameter adjusting unit is configured to adjust network parameters of the second image processing model to be trained and the local relation extraction model to be trained based on the generated loss value, so as to obtain a trained second image processing model.

According to a fourth aspect, there is provided an apparatus for detecting an image, the apparatus comprising: an image acquisition unit configured to acquire an image to be detected; a detection unit configured to input an image to be detected to a pre-trained image processing model, and generate an object detection result corresponding to the image to be detected, wherein the image processing model is obtained by training according to the method described in any implementation manner of the first aspect.

According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first aspect and the second aspect.

According to a sixth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for enabling a computer to perform the method as described in an implementation form of any one of the first and second aspects.

According to a seventh aspect, a computer program product is provided, the computer program product comprising a computer program which, when executed by a processor, is capable of implementing the method as described in any one of the implementations of the first and second aspects.

According to the technology disclosed by the invention, the incidence relation among the characteristics of the objects presented in different areas in the target image is established through the local relation extraction model, and the second image processing model is trained through supervision between the first local relation characteristic corresponding to the first image processing model and the second local relation characteristic corresponding to the second image processing model to be trained. Therefore, the second image processing model serving as the student network in the knowledge distillation can better learn the comprehension capability of the first image processing model serving as the teacher network on the incidence relation among the local features in the target image, the defect caused by the fact that the knowledge distillation applied to the image detection task in the prior art lacks of the representation capability of the incidence relation among multiple targets in the detection task is overcome, and the knowledge distillation effect applied to the image detection task is effectively improved. Furthermore, the accuracy and efficiency of image detection of the image processing model serving as the student network are improved, and a technical basis is provided for light-weight deployment of a high-performance and small-scale network, so that the application of the image processing model is possible without being strictly limited by high-performance hardware.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1a, FIG. 1b, FIG. 1c, FIG. 1d, FIG. 1e are schematic diagrams according to a first embodiment of the disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram of one application scenario in which a method for training an image processing model of an embodiment of the present disclosure may be implemented;

FIG. 4 is a schematic diagram of an apparatus for training an image processing model according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of an apparatus for detecting an image according to an embodiment of the present disclosure;

fig. 6 is a block diagram of an electronic device for implementing a method for detecting an image according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1a is a schematic diagram 100 illustrating a first embodiment according to the present disclosure. The method for training the image processing model comprises the following steps:

s101, at least two first local features and at least two corresponding second local features are obtained.

In this embodiment, an executive of the method for training an image processing model may acquire at least two first local features and corresponding at least two second local features in various ways. The first local feature may be obtained by performing feature extraction on a first target region in a target image based on a first image processing model trained in advance. The second local feature may be obtained by performing feature extraction on a second target region in the target image based on a second image processing model to be trained corresponding to the first image processing model.

In this embodiment, as an example, the execution subject may locally acquire the at least two first local features and the corresponding at least two second local features. As yet another example, the execution subject may acquire the at least two first local features and the corresponding at least two second local features from a communicatively connected electronic device (e.g., an electronic device for feature extraction).

The first image processing model may be a complex model in knowledge distillation, and the second image processing model to be trained may be a simplified model in knowledge distillation corresponding to the first image processing model.

It should also be noted that the first target area and the second target area are generally matched with each other. As an example, the first target area and the second target area may be the same area in the target image, or may be areas whose positions differ by less than a preset threshold. The number of first partial features and corresponding second partial features mentioned above generally corresponds.

S102, inputting the at least two first local features and the at least two corresponding second local features into corresponding local relation extraction models to be trained respectively, and generating the first local relation features and the second local relation features.

In this embodiment, the executing body may respectively input the at least two first local features and the at least two corresponding second local features acquired in step S101 to corresponding local relationship extraction models to be trained in various ways, so as to generate the first local relationship features and the second local relationship features. Wherein the first local relational feature may be used to characterize an associative relationship between the at least two first local features. The second local relationship feature may be used to characterize an associative relationship between the at least two second local features.

In this embodiment, since the first local feature and the second local feature are obtained by feature extraction on a region matched in the target image based on a first image processing model trained in advance and a corresponding second image processing model to be trained, respectively, the first local feature and the second local feature may be used to characterize an object presented in a certain region in the target image. Thus, the first local relationship feature may be used to characterize the association relationship between objects presented in different areas in the target image indicated by the at least two first local features. Similarly, the second local relationship feature may be used to characterize an association relationship between objects represented by different areas in the target image indicated by the at least two second local features.

In this embodiment, the first image processing model and the second image processing model to be trained corresponding to the first image processing model may respectively correspond to the local relationship extraction model to be trained. Usually, the structures of the respective corresponding local relationship extraction models are identical. The initial values of the network parameters of the respective corresponding local relationship extraction models may be the same or different.

And S103, generating a loss value based on the first local relation characteristic and the second local relation characteristic by using a preset loss function.

In this embodiment, the execution agent may generate a loss value based on the first local relationship feature and the second local relationship feature generated in step S103 by using a preset loss function. The loss function may include various loss functions that may be used to supervise the knowledge distillation process, such as an L2 loss function, among others.

And S104, adjusting network parameters of the second image processing model to be trained and the local relation extraction model to be trained based on the generated loss value to obtain the trained second image processing model.

In this embodiment, based on the generated loss value, the executing entity may adjust the network parameters of the second image processing model to be trained and the local relationship extraction model to be trained in various ways, so as to obtain the trained second image processing model. As an example, the executing entity may adjust network parameters of the second image processing model to be trained and the local relationship extraction model to be trained by a back propagation (back propagation) method. After the preset number of iterations reaches the training stop condition, the execution subject may determine the second image processing model after the parameter adjustment as the trained second image processing model.

In the method provided by the foregoing embodiment of the present disclosure, the local relationship extraction model is used to establish an association relationship between features of objects represented in different regions in the target image, and the second image processing model is trained by supervising the first local relationship feature corresponding to the first image processing model and the second local relationship feature corresponding to the second image processing model to be trained. Therefore, the second image processing model serving as the student network in the knowledge distillation can better learn the comprehension capability of the first image processing model serving as the teacher network on the incidence relation among the local features in the target image, the defect caused by the fact that the knowledge distillation applied to the image detection task in the prior art lacks of the representation capability of the incidence relation among multiple targets in the detection task is overcome, and the knowledge distillation effect applied to the image detection task is effectively improved. Furthermore, the accuracy and efficiency of image detection of the image processing model as the second image processing model are improved, and a technical basis is provided for light-weight deployment of a high-performance and small-scale network, so that the application of the image processing model is possible without being strictly limited by high-performance hardware.

In some optional implementation manners of this embodiment, the executing body may input at least two first local features and at least two corresponding second local features to corresponding local relationship extraction models to be trained, respectively, according to the following steps, and generate the first local relationship feature and the second local relationship feature:

in a first step, the following generation steps are performed: selecting unselected first local feature pairs and second local feature pairs from the at least two first local features and the at least two second local features respectively; generating a first local sub-relational feature and a second local sub-relational feature corresponding to the selected first local feature pair and the second local feature pair respectively; it is determined whether there are unselected pairs of first and second local features, respectively, of the at least two first and second local features.

In these implementations, the execution agent may execute the generating step as described above. The generating step may include the sub-steps of:

substep 1, selecting unselected first local feature pairs from the at least two first local features obtained in the step S101; selecting unselected second local feature pairs from the at least two second local features obtained in the step S101;

a substep 2 of generating a first local sub-relationship feature corresponding to the selected first local feature pair; generating a second local sub-relationship feature corresponding to the selected second local feature pair;

a substep 3 of determining whether there exists an unselected first local feature pair in the at least two first local features acquired in the step S101; and determining whether there is an unselected second local feature pair in the at least two second local features acquired in the step S101.

In these implementations, the number of the above-described first local sub-relational features is generally not less than the number of the selected first local feature pairs. As an example, when the order of the two is not considered, the number of the above-described first local sub-relationship features may coincide with the number of the selected first local feature pairs. Alternatively, the number of the first local sub-relational features may be not more than twice the number of the selected first local feature pairs. As an example, when considering the order of the two, e.g. the influence of a person in the image on the horse is different from the influence of the horse on the person, one first local feature pair may correspond to two first local sub-relational features. The relationship of the second local sub-relationship features is known in the same way.

And a second step of continuing to perform the generating step in response to determining that the presence exists.

In these implementations, in response to determining that the determination result of substep 3 described above is indicative of presence, the executing agent may continue to perform the generating step including substeps 1 through 3 described above. Thus, the first local sub-relationship feature generated in the present solution may be used to characterize an association relationship between any two of the acquired first local features. Similarly, the second local sub-relationship feature generated in the present scheme may be used to characterize an association relationship between any two second local features in the obtained second local features.

Third, a first local relational feature is generated based on the combination between the generated first local sub-relational features.

In these implementations, the execution principal may generate the first local relational feature in various ways based on the combination between the first local sub-relational features generated in the second step. As an example, the execution subject may perform a weighted summation of the generated first local sub-relational features to generate the first local relational features. As yet another example, the execution principal may concatenate the generated first local sub-relational features to generate the first local relational feature.

And fourthly, generating second local relation characteristics based on the combination of the generated second local sub-relation characteristics.

In these implementations, the execution subject may generate the second local relational feature in various ways based on the combination between the second local sub-relational features generated in the second step. As an example, the execution subject may perform weighted summation on the generated second local sub-relationship features to generate the second local relationship features. As yet another example, the execution subject may join the generated second local sub-relationship features to generate a second local relationship feature.

Based on the above optional implementation manner, in the present scheme, the foregoing generation steps may be executed in a loop, and corresponding first local sub-relational features and second local sub-relational features are generated for any two of the obtained first local features and any two of the second local features, respectively, so as to generate the first local relational feature and the second local relational feature. Therefore, the incidence relation among different local characteristics can be more comprehensively and meticulously characterized, and a foundation is provided for improving the knowledge distillation effect.

Optionally, based on the implementation described in the first step, referring to fig. 1b and fig. 1c, the network parameters of the local relationship extraction model to be trained may include the first weight. The first and second pairs of partial features may include first and second partial features, respectively. The execution subject may generate the first local sub-relational feature and the second local sub-relational feature corresponding to the selected first local feature pair and the second local feature pair, respectively, according to the following steps:

and S1, generating a first local preprocessing feature and a second local preprocessing feature based on the product of the first local feature in the selected first local feature pair and the first local feature in the selected second local feature pair and the first weight of the corresponding local relation extraction model to be trained respectively.

In these implementations, the first weight may be adjusted during the training process.

S2, generating a first preprocessed feature and a second preprocessed feature based on the dot product of the first local feature and the second local feature in the selected first local feature pair and the dot product of the first local feature and the second local feature in the selected second local feature pair, respectively.

S3, based on the matrix product of the first local preprocessed feature and the first preprocessed feature and the matrix product of the second local preprocessed feature and the second preprocessed feature, generating a first local sub-relational feature corresponding to the selected first local feature pair and a second local sub-relational feature corresponding to the selected second local feature pair, respectively.

Based on the above optional implementation manner, the scheme provides a feature generation manner considering the influence of an object (such as a horse in an image) indicated by one local feature on an object (such as a person riding a horse in an image) indicated by another local feature, so that the mutual influence relationship among different local features can be more comprehensively and finely characterized, and a basis is provided for improving the knowledge distillation effect.

Optionally, based on the implementation described in the step S2, referring to fig. 1d and fig. 1e, the network parameters of the local relationship extraction model to be trained may further include a second weight and a third weight. Based on the dot product of the first local feature in the selected first local feature pair and the dot product of the first local feature and the second local feature in the selected second local feature pair, the execution body may generate the first preprocessed feature and the second preprocessed feature respectively as follows:

and S21, generating a first front sub-feature and a second front sub-feature based on the product of the first local feature in the selected first local feature pair and the first local feature in the selected second local feature pair and the second weight of the corresponding local relation extraction model to be trained respectively.

And S22, generating a first post-sub-feature and a second post-sub-feature based on the product of the second local feature in the selected first local feature pair and the product of the second local feature in the selected second local feature pair and the third weight of the corresponding local relation extraction model to be trained respectively.

In these implementations, the second weight and the third weight may be adjusted during the training process.

And S23, respectively inputting the dot product of the first front sub-feature and the first rear sub-feature and the dot product of the second front sub-feature and the second rear sub-feature into the normalized exponential function, and respectively generating a first preprocessing feature and a second preprocessing feature.

Based on the above optional implementation manner, the scheme provides another feature generation manner for considering the influence of an object (such as a horse in an image) indicated by one local feature on an object (such as a person riding a horse in the image) indicated by another local feature, so that the interaction relation among different local features is enriched, and a basis is provided for improving the knowledge distillation effect.

In some optional implementations of the embodiment, the executing body may obtain the at least two first local features and the corresponding at least two second local features according to the following steps:

the method comprises the following steps of firstly, obtaining characteristic graphs of a first image processing model and a second image processing model to be trained respectively aiming at a target image.

In these implementations, the executing body may acquire feature maps of the first image processing model and the second image processing model to be trained respectively for the target image in various ways. As an example, the executing body may input the target image into the first image processing model and the second image processing model to be trained, respectively, to obtain feature maps of the first image processing model and the second image processing model to be trained, respectively, for the target image.

And secondly, generating coordinate information of at least two candidate frames corresponding to the feature map corresponding to the second image processing model to be trained by utilizing a Bounding-Box Regression (Bounding-Box Regression) technology.

In these implementations, the candidate boxes described above are typically used to indicate the location of the detected target. The objectives may be the same (e.g., cats) or different (e.g., horses, cattle, humans, etc.).

And thirdly, generating at least two masks according to the generated coordinate information.

In these implementations, the mask may correspond to coordinate information of the candidate box. As an example, the generated mask may correspond to the candidate boxes one to one.

And fourthly, extracting at least two first local features and at least two corresponding second local features from the acquired feature map respectively according to the generated mask.

In these implementations, the executing body may extract, by using the mask generated in the third step, at least two first local features and at least two corresponding second local features from the feature map of the acquired first image processing model for the target image and the feature map of the second image processing model to be trained for the target image, respectively. As an example, each of the above-mentioned masks may correspond to one extracted first local feature and one extracted second local feature.

Based on the optional implementation manner, the scheme can extract the first local feature and the second local feature from the feature maps respectively corresponding to the first image processing model and the second image processing model by utilizing the coordinate information of the candidate frame generated by the second image processing model to be trained, so that the generation manners of the first local feature and the second local feature are enriched, and the association between the loss value generated according to the first local relation feature and the second local relation feature is established through the coordinate information of the candidate frame, so that the training of the second image processing model serving as a student network is more pertinent, the effect of knowledge distillation is facilitated to be improved, and the detection accuracy and speed of the image processing model are improved.

With continued reference to fig. 2, fig. 2 is a schematic diagram 200 according to a second embodiment of the present disclosure. The method for detecting an image includes the steps of:

s201, acquiring an image to be detected.

In the present embodiment, the execution subject of the method for optimizing the position information can acquire the image to be detected in various ways. Wherein the image to be detected may generally comprise at least two detectable objects. The detectable objects may be the same (e.g., a plurality of horses) or different (e.g., a person and a horse), and are not limited herein.

In this embodiment, as an example, the execution main body may acquire the image to be detected from an electronic device connected locally or by communication.

S202, inputting the image to be detected into a pre-trained image processing model, and generating an object detection result corresponding to the image to be detected.

In this embodiment, the executing body may input the image to be detected acquired in step S201 to a pre-trained image processing model, and generate an object detection result corresponding to the image to be detected. The image processing model may be obtained by training through a method as described in any implementation manner of the foregoing embodiments. The object detection result may be used to indicate a detectable object included in the image to be detected. As an example, the object detection result may be "human horse riding".

As can be seen from fig. 2, the flow 200 of the method for detecting an image in the present embodiment represents a step of detecting an object in an image by using a pre-trained image processing model. Therefore, the scheme described in this embodiment can realize compression of a complex network into a network with a more simplified structure, and simultaneously ensure better image detection performance, thereby providing a technical basis for application deployment of the model and improvement of detection speed, and further improving detection accuracy and speed of object detection in the image.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of a method for training an image processing model according to an embodiment of the present disclosure. In the application scenario of fig. 3, the technician may enter at least two first local features and corresponding at least two second local features 301 into a server (not shown in the figure). The at least two first local features and the at least two corresponding second local features 301 may be obtained by feature extraction of the horse-riding image 3013 based on the first image processing model 3011 and the second image processing model 3012, respectively. The server may input the at least two first local features and the at least two corresponding second local features 301 into corresponding local

relationship extraction models

3021 and 3022 to be trained, respectively, to generate a "person & horse relationship feature" 3031 as the first local relationship feature and a "person & horse relationship feature" 3032 as the second local relationship feature. With the use of the preset loss function 304, the server may adjust the network parameters of the first image processing model 3012 and the local

relationship extraction models

3021, 3022 to be trained based on the loss values generated from the first and second local relationship features 3031, 3032.

At present, one of the prior art generally directly sorts and extracts features of an image to be used as a basis for knowledge distillation, so that the conventional distillation algorithm is often poor in effect when solving the distillation problem of an image detection task. In the method provided by the above embodiment of the present disclosure, the local relationship extraction model is used to establish the association relationship between the features of the objects represented in different regions in the target image, and the second image processing model is trained by monitoring the first local relationship feature corresponding to the first image processing model and the second local relationship feature corresponding to the second image processing model to be trained. Therefore, the second image processing model serving as the student network in the knowledge distillation can better learn the comprehension capability of the first image processing model serving as the teacher network on the incidence relation among the local features in the target image, the defect caused by the fact that the knowledge distillation applied to the image detection task in the prior art lacks of the representation capability of the incidence relation among multiple targets in the detection task is overcome, and the knowledge distillation effect applied to the image detection task is effectively improved. Furthermore, the accuracy and efficiency of image detection of the image processing model serving as the student network are improved, and a technical basis is provided for light-weight deployment of a high-performance and small-scale network, so that the application of the image processing model is possible without being strictly limited by high-performance hardware.

With further reference to fig. 4, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an apparatus for training an image processing model, which corresponds to the method embodiment shown in fig. 1, and which is particularly applicable in various electronic devices.

As shown in fig. 4, the apparatus 400 for training an image processing model according to the present embodiment includes a local feature obtaining unit 401, a relational feature generating unit 402, a loss value generating unit 403, and a parameter adjusting unit 404. The local feature obtaining unit 401 is configured to obtain at least two first local features and at least two corresponding second local features, where the first local features are obtained by performing feature extraction on a first target region in a target image based on a pre-trained first image processing model, and the second local features are obtained by performing feature extraction on a second target region in the target image based on a to-be-trained second image processing model corresponding to the first image processing model; a relationship feature generation unit 402, configured to input at least two first local features and corresponding at least two second local features to corresponding local relationship extraction models to be trained, respectively, and generate the first local relationship features and the second local relationship features, where the first local relationship features are used for characterizing an association relationship between the at least two first local features, and the second local relationship features are used for characterizing an association relationship between the at least two second local features; a loss value generation unit 403 configured to generate a loss value based on the first local relationship feature and the second local relationship feature using a preset loss function; a parameter adjusting unit 404 configured to adjust network parameters of the second image processing model to be trained and the local relationship extraction model to be trained based on the generated loss value, resulting in a trained second image processing model.

In the present embodiment, in the apparatus 400 for training an image processing model: the specific processing and the technical effects of the local feature obtaining unit 401, the relationship feature generating unit 402, the loss value generating unit 403, and the parameter adjusting unit 404 may refer to the related descriptions of steps S101, S102, S103, and S104 in the corresponding embodiment of fig. 1, and are not described herein again.

In some optional implementations of this embodiment, the relationship characteristic generating unit 402 may include: a generating subunit (not shown in the figures) configured to perform the following generating steps: selecting unselected first local feature pairs and second local feature pairs from the at least two first local features and the at least two second local features respectively; generating a first local sub-relational feature and a second local sub-relational feature corresponding to the selected first local feature pair and the second local feature pair respectively; determining whether there are unselected first and second local feature pairs in the at least two first and second local features, respectively; a loop subunit (not shown in the figures) configured to continue to perform the generating step in response to determining that there is a presence; a first relational feature generation subunit (not shown in the figure) configured to generate a first local relational feature based on a combination between the generated first local sub-relational features; a second relation feature generation subunit (not shown in the figure) configured to generate a second local relation feature based on a combination between the generated second local sub-relation features.

In some optional implementations of the present embodiment, the network parameter of the local relationship extraction model to be trained may include a first weight. The first and second pairs of partial features may include first and second partial features, respectively. The generating subunit may include: a first generation module (not shown in the figure) configured to generate a first local preprocessing feature and a second local preprocessing feature based on products of the first local feature in the selected first local feature pair and the first local feature in the selected second local feature pair with first weights of the corresponding local relationship extraction models to be trained, respectively; a second generation module (not shown in the figure) configured to generate a first preprocessed feature and a second preprocessed feature based on a dot product of the first local feature and the second local feature in the selected first local feature pair and a dot product of the first local feature and the second local feature in the selected second local feature pair, respectively; a third generating module (not shown in the figure) configured to generate a first local sub-relationship feature corresponding to the selected first local feature pair and a second local sub-relationship feature corresponding to the selected second local feature pair, respectively, based on the matrix product of the first local pre-processed feature and the first pre-processed feature and the matrix product of the second local pre-processed feature and the second pre-processed feature.

In some optional implementations of this embodiment, the network parameters of the local relationship extraction model to be trained may further include a second weight and a third weight. The second generating module may be further configured to: generating a first front sub-feature and a second front sub-feature based on the product of the first local feature in the selected first local feature pair and the first local feature in the selected second local feature pair and the second weight of the corresponding local relation extraction model to be trained respectively; generating a first back sub-feature and a second back sub-feature based on the product of the second local feature in the selected first local feature pair and the product of the second local feature in the selected second local feature pair and the third weight of the corresponding local relation extraction model to be trained respectively; and respectively inputting the dot product of the first front sub-feature and the first rear sub-feature and the dot product of the second front sub-feature and the second rear sub-feature into the normalized exponential function to respectively generate a first preprocessing feature and a second preprocessing feature.

In some optional implementations of the present embodiment, the local feature obtaining unit 401 may be further configured to: acquiring feature maps of a first image processing model and a second image processing model to be trained respectively aiming at a target image; generating coordinate information of at least two candidate frames corresponding to the feature map corresponding to the second image processing model to be trained by utilizing a bounding box regression technology; generating at least two masks according to the generated coordinate information; at least two first local features and corresponding at least two second local features are extracted from the acquired feature map, respectively, according to the generated mask.

The apparatus provided by the above-mentioned embodiment of the present disclosure establishes, by the relationship feature generation unit 402, an association relationship between features indicating objects present in different regions in the target image, which are acquired by the local feature acquisition unit 401, by using the local relationship extraction model, and trains, by the parameter adjustment unit 404, the second image processing model by using the loss value, which is generated by the loss value generation unit 403, for indicating a difference between the first local relationship feature corresponding to the first image processing model and the second local relationship feature corresponding to the second image processing model to be trained. Therefore, the second image processing model serving as the student network in the knowledge distillation can better learn the comprehension capability of the first image processing model serving as the teacher network on the incidence relation among the local features in the target image, the defect caused by the fact that the knowledge distillation applied to the image detection task in the prior art lacks of the representation capability of the incidence relation among multiple targets in the detection task is overcome, and the knowledge distillation effect applied to the image detection task is effectively improved. Furthermore, the accuracy and efficiency of image detection of the image processing model serving as the student network are improved, and a technical basis is provided for light-weight deployment of a high-performance and small-scale network, so that the application of the image processing model is possible without being strictly limited by high-performance hardware.

With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an apparatus for detecting an image, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable in various electronic devices.

As shown in fig. 5, the apparatus 500 for detecting an image provided by the present embodiment includes an image acquisition unit 501 and a detection unit 502. The image acquiring unit 501 is configured to acquire an image to be detected; the detection unit 502 is configured to input an image to be detected to a pre-trained image processing model, and generate an object detection result corresponding to the image to be detected, where the image processing model is trained by the method for training the image processing model as described in any implementation manner in the foregoing embodiments.

In the present embodiment, in the apparatus 500 for detecting an image: the specific processing of the image obtaining unit 501 and the detecting unit 502 and the technical effects thereof can refer to the related descriptions of steps S201 and S202 in the corresponding embodiment of fig. 2, which are not repeated herein.

According to the device provided by the embodiment of the disclosure, the pre-trained image processing model serving as the student network is used for detecting the object in the image, so that the complex network can be compressed into the network with a more simplified structure, and the better image detection performance is ensured, thereby providing a technical basis for the application deployment and the improvement of the detection speed of the model. Furthermore, the accuracy and efficiency of image detection of the image processing model serving as the student network are improved, and a technical basis is provided for light-weight deployment of a high-performance and small-scale network, so that the application of the image processing model is possible without being strictly limited by high-performance hardware.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, necessary confidentiality measures are taken, and the customs of the public order is not violated.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 606 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as a method for training an image processing model network or a method for detecting an image. For example, in some embodiments, the method for training an image processing model or the method for detecting an image may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM603 and executed by the computing unit 601, one or more steps of the method for training an image processing model or the method for detecting an image described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured by any other suitable means (e.g. by means of firmware) to perform the method for training the image processing model or the method for detecting the image.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for training an image processing model, comprising:

the method comprises the steps of obtaining at least two first local features and at least two corresponding second local features, wherein the first local features are obtained by performing feature extraction on a first target region in a target image based on a pre-trained first image processing model, and the second local features are obtained by performing feature extraction on a second target region in the target image based on a to-be-trained second image processing model corresponding to the first image processing model;

inputting the at least two first local features and the corresponding at least two second local features into corresponding local relationship extraction models to be trained respectively, and generating first local relationship features and second local relationship features, wherein the first local relationship features are used for representing the association relationship between the at least two first local features, and the second local relationship features are used for representing the association relationship between the at least two second local features;

generating a loss value based on the first local relational feature and the second local relational feature by using a preset loss function;

and adjusting network parameters of the second image processing model to be trained and the local relation extraction model to be trained based on the generated loss value to obtain a trained second image processing model.

2. The method of claim 1, wherein the inputting the at least two first local features and the corresponding at least two second local features into corresponding local relationship extraction models to be trained, respectively, and generating the first local relationship features and the second local relationship features comprises:

the following generation steps are performed: selecting unselected first and second local feature pairs from the at least two first and second local features, respectively; generating a first local sub-relational feature and a second local sub-relational feature corresponding to the selected first local feature pair and the second local feature pair respectively; determining whether there are unselected first and second local feature pairs of the at least two first and second local features, respectively;

in response to determining that there is, continuing to perform the generating step;

generating the first local relational feature based on a combination between the generated first local sub-relational features;

generating the second local relational feature based on a combination between the generated second local sub-relational features.

3. The method of claim 2, wherein the network parameters of the local relationship extraction model to be trained comprise first weights, the first and second pairs of local features comprising first and second local features, respectively; and

the generating a first local sub-relational feature and a second local sub-relational feature corresponding to the selected first local feature pair and the second local feature pair respectively includes:

generating a first local preprocessing feature and a second local preprocessing feature based on the product of the first local feature in the selected first local feature pair and the first local feature in the selected second local feature pair and the first weight of the corresponding local relation extraction model to be trained respectively;

generating a first preprocessing feature and a second preprocessing feature respectively based on the dot product of the first local feature and the second local feature in the selected first local feature pair and the dot product of the first local feature and the second local feature in the selected second local feature pair;

and generating a first local sub-relation feature corresponding to the selected first local feature pair and a second local sub-relation feature corresponding to the selected second local feature pair respectively based on the matrix product of the first local pre-processing feature and the first pre-processing feature and the matrix product of the second local pre-processing feature and the second pre-processing feature.

4. The method of claim 3, wherein the network parameters of the local relationship extraction model to be trained further comprise a second weight and a third weight; and

the generating a first preprocessed feature and a second preprocessed feature respectively based on a dot product of the first local feature and the second local feature in the selected first local feature pair and a dot product of the first local feature and the second local feature in the selected second local feature pair includes:

generating a first front sub-feature and a second front sub-feature based on the product of the first local feature in the selected first local feature pair and the first local feature in the selected second local feature pair and the second weight of the corresponding local relation extraction model to be trained respectively;

generating a first back sub-feature and a second back sub-feature based on the product of the second local feature in the selected first local feature pair and the product of the second local feature in the selected second local feature pair and the third weight of the corresponding local relation extraction model to be trained respectively;

and respectively inputting the dot product of the first front sub-feature and the first rear sub-feature and the dot product of the second front sub-feature and the second rear sub-feature into a normalized exponential function to respectively generate a first preprocessing feature and a second preprocessing feature.

5. The method according to one of claims 1 to 4, wherein the obtaining at least two first local features and corresponding at least two second local features comprises:

acquiring feature maps of the first image processing model and the second image processing model to be trained respectively aiming at the target image;

generating coordinate information of at least two candidate frames corresponding to the feature map corresponding to the second image processing model to be trained by utilizing a bounding box regression technology;

generating at least two masks according to the generated coordinate information;

at least two first local features and corresponding at least two second local features are extracted from the acquired feature map, respectively, according to the generated mask.

6. A method for detecting an image, comprising:

acquiring an image to be detected;

inputting the image to be detected into a pre-trained image processing model, and generating an object detection result corresponding to the image to be detected, wherein the image processing model is obtained by training according to the method of any one of claims 1 to 5.

7. An apparatus for training an image processing model, comprising:

the local feature acquisition unit is configured to acquire at least two first local features and at least two corresponding second local features, wherein the first local features are obtained by performing feature extraction on a first target region in a target image based on a pre-trained first image processing model, and the second local features are obtained by performing feature extraction on a second target region in the target image based on a to-be-trained second image processing model corresponding to the first image processing model;

a relation feature generation unit, configured to input the at least two first local features and the corresponding at least two second local features to corresponding local relation extraction models to be trained, respectively, and generate first local relation features and second local relation features, where the first local relation features are used for characterizing a relation between the at least two first local features, and the second local relation features are used for characterizing a relation between the at least two second local features;

a loss value generation unit configured to generate a loss value based on the first and second local relational features using a preset loss function;

and the parameter adjusting unit is configured to adjust the network parameters of the second image processing model to be trained and the local relation extraction model to be trained based on the generated loss value, so as to obtain a trained second image processing model.

8. An apparatus for detecting an image, comprising:

an image acquisition unit configured to acquire an image to be detected;

a detection unit configured to input the image to be detected to a pre-trained image processing model, and generate an object detection result corresponding to the image to be detected, wherein the image processing model is trained by the method according to any one of claims 1 to 5.

9. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.

11. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-6.