CN112132146A

CN112132146A - Training method and device of image cropping model and image cropping method and device

Info

Publication number: CN112132146A
Application number: CN202010817887.9A
Authority: CN
Inventors: 张健为; 赖申其; 柴振华
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2020-08-14
Filing date: 2020-08-14
Publication date: 2020-12-25

Abstract

The application discloses a training method and a device of an image cropping model, and a method and a device of image cropping, wherein the training method of the image cropping model comprises the following steps: acquiring a first image characteristic of a training image pair extracted by a first image cutting model, and acquiring a second image characteristic of the training image pair extracted by a second image cutting model; according to the first original image feature and the first sub-image feature, determining a grading difference between the original image and the sub-image to determine a first sequence loss value; performing knowledge distillation on the first image cropping model according to second image features extracted by the second image cropping model, and determining a distillation loss value; parameters of the first image cropping model are updated according to the first order loss value and the distillation loss value. According to the scheme, the implicit characteristics such as the saliency characteristics of the image can be automatically learned, the robustness of the characteristics is improved, the retention degree of important information of the image is considered, the lightweight design of the model effectively reduces the feedforward time, and the practicability of the model is enhanced.

Description

Training method and device of image cropping model and image cropping method and device

Technical Field

The application relates to the technical field of image cropping, in particular to a training method and device of an image cropping model and an image cropping method and device.

Background

The intelligent image cropping is to crop or display an area containing image core features according to a specified size on the premise of not losing main information of an original image based on a human eye visual perception principle. The requirements of storage space and information retention are considered while browsing the image summaries. Most of the current image intelligent cropping schemes need to explicitly model common photographic rules (symmetry, trisection rule, emphasis and the like) and design a proper cropping strategy based on the explicit rules. As a first aesthetic quality assessment data set Ava was proposed at the CVPR (IEEE Conference on Computer Vision and Pattern Recognition, IEEE international Computer Vision and Pattern Recognition Conference) 2012 Conference, a smart clipping scheme based on deep learning began to emerge. The paper Learning to composite with Professional photos on the Web (Learning how to generate Professional photos on the network) published in the ACM MM (ACM International Conference on Multimedia Conference) 2017 Conference converts the image cropping task into a subgraph ranking task, designs a view search network through a sliding window search strategy, learns the implicitly coded photography rules in the Professional photos, and achieves the current optimal level on a plurality of image cropping data sets.

However, the inventors found that the image cropping model adopted by the image cropping scheme in the prior art still needs to be further improved in image cropping efficiency.

Disclosure of Invention

In view of the above, the present application is made to provide a training method and apparatus of an image cropping model, and an image cropping method and apparatus that overcome or at least partially solve the above problems.

According to a first aspect of the present application, there is provided a training method of an image cropping model, comprising:

acquiring a first image feature of a training image pair extracted by a first image cropping model, and acquiring a second image feature of the training image pair extracted by a second image cropping model, wherein the training image pair comprises an original image and a subgraph thereof, and the first image feature comprises a first original image feature and a first subgraph feature;

according to the first original image feature and the first sub-image feature, determining a grading difference between the original image and the sub-image, and determining a first sequence loss value according to the grading difference;

performing knowledge distillation on the first image cropping model according to a second image feature extracted by the second image cropping model, and determining a distillation loss value;

updating parameters of the first image cropping model according to the first order loss value and the distillation loss value.

Optionally, the second image cropping model includes an image scoring sub-model, the second image feature includes a second original image feature and a second sub-image feature, the distillation loss value includes an original image feature distillation loss value and a sub-image feature distillation loss value, the knowledge distillation is performed on the first image cropping model according to the second image feature extracted by the second image cropping model, and determining the distillation loss value includes:

determining an original image characteristic distillation loss value according to a characteristic difference between the first original image characteristic and the second original image characteristic;

determining the subpicture distillation loss value based on the feature difference between the first subpicture and the second subpicture.

Optionally, the image scoring sub-model is trained by:

extracting a second original image feature and a second sub-image feature of the training image pair by using the convolution layer of the image scoring sub-model;

processing the second original image feature and the second sub-image feature by using a full connection layer of the image scoring sub-model, and determining scoring difference between the original image and the sub-image according to the processed second original image feature and the processed second sub-image feature;

and determining a second sequencing loss value according to the grading difference, and updating the parameters of the image grading sub-model according to the second sequencing loss value.

Optionally, the second image cropping model comprises an image saliency detection sub-model, the second image features comprise saliency features, the distillation loss values comprise saliency distillation loss values, and the obtaining second image features of the training image pair extracted by the second image cropping model comprises:

extracting the saliency characteristics of the original image by using an image saliency detection sub-model;

knowledge distillation is performed on the first image cropping model according to the second image features extracted by the second image cropping model, and determining a distillation loss value comprises:

up-sampling a first image feature extracted by the first image cropping model;

determining the significant distillation loss value according to a feature difference between the upsampled first image feature and the significant feature.

Optionally, the saliency features include an original saliency feature and a sub-map saliency feature, the saliency distillation loss value includes an original saliency distillation loss value and the sub-map saliency distillation loss value, and the determining the saliency distillation loss value according to the feature difference between the first up-sampled image feature and the saliency feature includes:

determining the original drawing significance distillation loss value according to the feature difference between the first original drawing feature and the original drawing significance feature after the up-sampling;

determining the sub-graph significance distillation loss value according to the characteristic difference between the first sub-graph characteristic after the up-sampling and the sub-graph significance characteristic.

Optionally, the distillation loss value comprises a characteristic distillation loss value, and the updating the parameter of the first image cropping model according to the first ordering loss value and the distillation loss value comprises:

carrying out global pooling on the image saliency characteristics to obtain global pooling characteristics;

determining a weight of the characteristic distillation loss value and a weight of the significant distillation loss value according to the global pooling characteristics;

and updating the parameters of the first image cropping model according to the first sequencing loss value, the characteristic distillation loss value, the significant distillation loss value and the corresponding weight.

Optionally, the training image pair is obtained by:

grabbing an original image through the Internet;

randomly cutting the original image to obtain a subgraph corresponding to the original image;

and reconstructing the original image and the corresponding subgraph to obtain the original image and the subgraph with the same size as the training image pair.

According to a second aspect of the present application, there is provided an image cropping method comprising:

acquiring an image to be cut;

scanning the image to be cut by using a sliding window with a preset size to obtain a plurality of sub-images;

and (3) grading each sub-image by using a first image cutting model, and outputting a cutting result of the image to be cut according to the grading result, wherein the first image cutting model is obtained by training based on the training method of the image cutting model.

According to a third aspect of the present application, there is provided a training apparatus for an image cropping model, comprising:

the image processing device comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring first image features of a training image pair extracted by a first image cropping model and acquiring second image features of the training image pair extracted by a second image cropping model, the training image pair comprises original images and subgraphs thereof, and the first image features comprise first original image features and first subgraph features;

a first determining unit, configured to determine, according to the first original image feature and the first sub-image feature, a score difference between the original image and the sub-image, and determine a first ordering loss value according to the score difference;

a second determination unit, configured to perform knowledge distillation on the first image cropping model according to a second image feature extracted by the second image cropping model, and determine a distillation loss value;

an updating unit for updating the parameters of the first image cropping model according to the first ordering loss value and the distillation loss value.

Optionally, the second image cropping model comprises an image scoring sub-model, the second image feature comprises a second original image feature and a second sub-image feature, the distillation loss value comprises an original image feature distillation loss value and a sub-image feature distillation loss value, and the second determination unit is further configured to:

Optionally, the image scoring sub-model is trained by:

Optionally, the second image cropping model comprises an image saliency detection sub-model, the second image features comprise saliency features, the distillation loss values comprise saliency distillation loss values, the first obtaining unit is further configured to:

the second determination unit is further configured to:

up-sampling a first image feature extracted by the first image cropping model;

Optionally, the saliency features include an original saliency feature and a sub-map saliency feature, the saliency distillation loss value includes an original saliency distillation loss value and the sub-map saliency distillation loss value, and the second determination unit is further configured to:

Optionally, the distillation loss value comprises a characteristic distillation loss value, the updating unit is further configured to:

Optionally, the training image pair is obtained by:

grabbing an original image through the Internet;

According to a fourth aspect of the present application, there is provided an image cutting apparatus comprising:

a second acquisition unit for acquiring an image to be cut;

the scanning unit is used for scanning the image to be cut by utilizing a sliding window with a preset size to obtain a plurality of sub-images;

and the cutting unit is used for grading each sub-image by using a first image cutting model, and outputting a cutting result of the image to be cut according to the grading result, wherein the first image cutting model is obtained by training based on the training device of the image cutting model.

According to a fifth aspect of the present application, there is provided an electronic device comprising: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to perform a method of training an image cropping model as described in any of the above, or an image cropping method as described above.

According to a sixth aspect of the present application, there is provided a computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of training an image cropping model as described in any of the above, or the method of image cropping as described in any of the above.

According to the technical scheme, the training image pair comprises an original image and sub-images thereof, and the first image feature of the training image pair extracted by the first image cropping model and the second image feature of the training image pair extracted by the second image cropping model are obtained, wherein the first image feature comprises a first original image feature and a first sub-image feature; according to the first original image feature and the first sub-image feature, determining a grading difference between the original image and the sub-image, and determining a first sequence loss value according to the grading difference; performing knowledge distillation on the first image cropping model according to a second image feature extracted by the second image cropping model, and determining a distillation loss value; updating parameters of the first image cropping model according to the first order loss value and the distillation loss value. The image cropping model obtained by training can break through the limitation of manual feature design, enable the network to independently learn the implicit features of the image, and improve the robustness of the features. In addition, the network implicit coding image characteristics simultaneously give consideration to the retention degree of important information of the image, and the design of the model in a light weight mode can effectively reduce the feedforward time and enhance the practicability of the model.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 shows a schematic flow diagram of a method of training an image cropping model according to one embodiment of the present application;

FIG. 2 shows a schematic of a knowledge distillation flow diagram according to one embodiment of the present application;

FIG. 3 illustrates a schematic flow chart of the training of an image scoring sub-model according to one embodiment of the present application;

FIG. 4 shows a schematic of a knowledge distillation flow scheme according to another embodiment of the present application;

FIG. 5 illustrates a schematic flow chart of training an image cropping model according to one embodiment of the present application;

FIG. 6 shows a schematic flow diagram of an image cropping method according to one embodiment of the present application;

FIG. 7 illustrates an image cropping effect according to one embodiment of the present application;

FIG. 8 illustrates an image cropping effect according to another embodiment of the present application;

FIG. 9 is a schematic diagram illustrating an exemplary configuration of a training apparatus for an image cropping model according to an embodiment of the present application;

FIG. 10 shows a schematic diagram of an image cutting apparatus according to one embodiment of the present application;

FIG. 11 shows a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 12 shows a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The image cropping schemes in the prior art mainly include the following two types: one is a traditional image cropping scheme based on hand-designed features and the other is an image cropping scheme based on depth features.

Image cropping schemes based on hand-designed features are often affected by photographic rules and psychology, and have limitations. The limitations are mainly embodied in the following two aspects: firstly, the manually designed feature has limited dimensionality, and has fewer high-level semantic features (such as symmetry, definition, color and the like) with interpretability; secondly, because certain ambiguity exists between the photographic rule and the psychological rule, the implementation difficulty of the photographic rule and the psychological rule is large, and the validity of the manual design features is difficult to guarantee.

The image cropping scheme based on the depth features mainly utilizes a convolutional neural network to autonomously learn the implicit coding photographing rule in the image, and maximizes the difference value between the original image prediction score and the cropped sub-image prediction score. And in the practical stage, image cutting is realized by selecting the optimal subgraph through subgraph sequencing. This solution mainly suffers from the following three problems: firstly, the supervision information is weak, and the information learned by the network may contain noise which is irrelevant to the aesthetics in the image; secondly, under the influence of the size and the position of the sliding window, although a local area of the original image does not contain main information of the original image, the composition and the content still conform to the network judgment standard, so that a misjudgment condition with higher prediction score occurs; thirdly, in order to fully extract image features, a heavyweight network is usually adopted as a backbone network, but in practical application, the feedforward time of the network is too long, and the requirement of the system on the response time is difficult to meet.

Based on this, an embodiment of the present application provides a training method of an image cropping model, as shown in fig. 1, the training method of the image cropping model includes steps S110 to S140 as follows:

step S110, a first image feature of a training image pair extracted by a first image cropping model is obtained, a second image feature of the training image pair extracted by a second image cropping model is obtained, the training image pair comprises an original image and a sub-image thereof, and the first image feature comprises a first original image feature and a first sub-image feature.

In order to solve the problems that the model in the prior art is long in response time and prone to misjudgment and the like, the image cropping model provided by the application can be obtained based on the principle of knowledge distillation, wherein knowledge distillation is used for inducing training of a student network (student network: simplification and low complexity) to realize knowledge migration by introducing a soft-target (soft-target) related to a teacher network (teacher network: complicated but excellent in reasoning performance) as a part of a total loss function. Accordingly, the first image cropping model in the embodiment of the present application can be regarded as a lightweight student network to be trained, and the second image cropping model is a heavyweight teacher network trained in advance.

In specific implementation, the first image cropping model in the embodiment of the present application may adopt a lightweight network structure VGG-Tiny, where VGG is a network structure provided by the Visual Geometry Group (computer vision Group) of oxford university, and the core of the network structure is to prove that increasing the depth of the network can affect the final performance of the network to a certain extent. The training image pair of the embodiment of the application can comprise two parts: the method comprises the steps of firstly obtaining an original image (such as professional photographs crawled from various large network stations), and secondly obtaining a sub-image corresponding to the original image by randomly cutting the original image. And respectively extracting the features of the original image and the corresponding subgraph by using the first image cropping model, thereby obtaining a first original image feature and a first subgraph feature.

Step S120, according to the first original image feature and the first sub-image feature, determining a grading difference between the original image and the sub-image, and according to the grading difference, determining a first sequence loss value.

In the existing image cropping scheme, various indexes for evaluating the performance of the model are provided, wherein the Ranking Loss (Ranking Loss) is more common, the Ranking Loss refers to the number average of incorrectly ranked label pairs in a sample, and the indexes indicate the possibility that the sample has lower membership degree to the affiliated label than to the non-affiliated label.

In specific implementation, based on the first original image feature and the first sub-image feature, the first image cropping model may be used to obtain an aesthetic score of the original image and an aesthetic score of the sub-image, and the first ranking loss value may be further determined according to a score difference between the original image and the sub-image.

And step S130, performing knowledge distillation on the first image cropping model according to the second image characteristics extracted by the second image cropping model, and determining a distillation loss value.

In order to significantly reduce the forward inference time of the network and accelerate the response speed of the system in practical application, as described above, in the embodiment of the present application, the first image cropping model is trained in a knowledge distillation manner, and the second image cropping model is used to perform knowledge distillation on the pre-designed lightweight first image cropping model, so that the first image cropping model can learn the supervision information of the second image cropping model. Specifically, a difference between a second image feature extracted by the second image cropping model and a first image feature extracted by the first image cropping model may be compared, and a distillation loss value of the first image cropping model may be determined based on the difference to optimize a parameter of the first image cropping model based on the distillation loss value.

Step S140, updating the parameters of the first image cropping model according to the first sequence loss value and the distillation loss value.

Therefore, after the sequencing loss value and the distillation loss value of the first image cropping model are obtained, the sequencing loss value and the distillation loss value can be fused to obtain a fusion loss value, and the parameters of the first image cropping model are updated by continuously reducing the fusion loss value until the model achieves the expected effect.

The image cropping model obtained through the training in the process can break through the limitation of manual feature design, enable the network to independently learn the implicit features of the image, and improve the robustness of the features. In addition, the network implicit coding image characteristics simultaneously give consideration to the retention degree of important information of the image, and the design of the model in a light weight mode can effectively reduce the feedforward time and enhance the practicability of the model.

In one embodiment of the present application, the second image cropping model includes an image scoring sub-model, the second image feature includes a second original image feature and a second sub-image feature, the distillation loss value includes an original image feature distillation loss value and a sub-image feature distillation loss value, the knowledge distillation of the first image cropping model according to the second image feature extracted by the second image cropping model, and determining the distillation loss value includes: determining an original image characteristic distillation loss value according to a characteristic difference between the first original image characteristic and the second original image characteristic; determining the subpicture distillation loss value based on the feature difference between the first subpicture and the second subpicture.

In particular implementation, as shown in fig. 2, the teacher network for knowledge distillation according to the embodiment of the present application may include an image scoring sub-model (T1), where the image scoring sub-model may refer to an image aesthetic scoring sub-model, that is, a model for performing aesthetic scoring on an image by mining implicit image aesthetic features. Similarly to the first image cropping model, in the embodiment of the present application, when the image scoring submodel is used to perform feature extraction on the original image and the sub-image, a second original image feature and a second sub-image feature corresponding to the image scoring submodel may also be obtained, and then the original image feature and the sub-image feature are processed by using a full Connected layers (FC for short) of the model, respectively, so as to obtain corresponding image scores, and calculate the ranking loss according to the scores. Meanwhile, an original image characteristic distillation loss value can be obtained by comparing the characteristic difference between a first original image characteristic extracted by the first image cropping model (S) and a second original image characteristic extracted by the image scoring sub-model (T1), and a sub-image characteristic distillation loss value can be obtained by comparing the characteristic difference between the first sub-image characteristic and the second sub-image characteristic.

In one embodiment of the present application, the image scoring sub-model is trained by: extracting a second original image feature and a second sub-image feature of the training image pair by using the convolution layer of the image scoring sub-model; processing the second original image feature and the second sub-image feature by using a full connection layer of the image scoring sub-model, and determining scoring difference between the original image and the sub-image according to the processed second original image feature and the processed second sub-image feature; and determining a second sequencing loss value according to the grading difference, and updating the parameters of the image grading sub-model according to the second sequencing loss value.

In specific implementation, the image scoring sub-model of the teacher network in the embodiment of the present application may adopt a network structure VGG-19 proposed by the computer vision group of oxford university, but it is needless to say that a person skilled in the art may also adopt other types of networks, such as AlexNet (a network designed by Hinton, a championship acquirer for ImageNet competition in 2012 and Alex Krizhevsky, a reset-50 (a residual network), and the like, which are not listed here.

The image scoring sub-model in the embodiment of the application can be obtained by training in the following way: as shown in fig. 3, an original image and a subgraph obtained by randomly cutting the original image are obtained, the original image and the subgraph are reconstructed to the same size, then the feature extraction is respectively carried out on the original image and the subgraph by using a convolution layer of an image scoring submodel, so that the original image and the subgraph are mapped to a hidden layer feature space to obtain a corresponding second original image feature and a second subgraph feature, then the learned distributed feature representation is mapped to a sample marking space by using a full connection layer FC of the image scoring submodel, further an aesthetic score _1 of the original image and an aesthetic score _2 of the subgraph are respectively obtained, a second ordering loss value is calculated according to the scoring difference between the original image and the subgraph, and finally the parameters of the image scoring submodel are updated according to the second ordering loss value, so that the.

First ordering Loss value/second ordering Loss value (L for short) of the embodiments of the present application_ranking) The following formula can be adopted to calculate:

where n represents the number of training image pairs, score-1 represents the aesthetic score of the artwork, and score-2 represents the aesthetic score of the subgraph.

The above equation (1) is based on a simple assumption that: pictures taken by professional photographers generally have a relatively good composition, and if a cut image obtained by randomly scratching out one of the pictures is cut out, the picture cannot have a good composition with a high probability. That is, the score of the original image in composition should be higher than the score of the sub-image.

In one embodiment of the present application, the second image cropping model comprises an image saliency detection sub-model, the second image features comprise saliency features, the distillation loss values comprise saliency distillation loss values, and the obtaining second image features of the training image pair extracted by the second image cropping model comprises: extracting the saliency characteristics of the original image by using an image saliency detection sub-model; knowledge distillation is performed on the first image cropping model according to the second image features extracted by the second image cropping model, and determining a distillation loss value comprises: up-sampling a first image feature extracted by the first image cropping model; determining the significant distillation loss value according to a feature difference between the upsampled first image feature and the significant feature.

In specific implementation, as shown in fig. 4, when the first image cropping model is trained to predict the aesthetic score of the image, in addition to considering image content factors such as composition, the retention degree of the subgraph on the important information of the original image can be considered, so as to avoid the erroneous judgment of the model on the subgraph with low correlation. Therefore, the second image cropping model according to the embodiment of the present application may further include an image saliency detection sub-model (T2), and the saliency feature of the original image is extracted using the image saliency detection sub-model.

In order to perform pixel-level processing on an image, after abstract features of the image are extracted by the convolution layer, the first image features may be restored to the original size by up-sampling (up-sample). Common upsampling methods include bilinear interpolation, transposed convolution and the like, and a person skilled in the art can flexibly select which upsampling method to sample according to actual needs, and is not specifically limited herein. And then comparing the feature difference between the first image feature after the upsampling and the significance feature to determine a significance distillation loss value.

In one embodiment of the present application, the saliency features include an original saliency feature and a sub-map saliency feature, the saliency distillation loss value includes an original saliency distillation loss value and the sub-map saliency distillation loss value, and the determining the saliency distillation loss value according to a feature difference between an upsampled first image feature and the saliency feature includes: determining the original drawing significance distillation loss value according to the feature difference between the first original drawing feature and the original drawing significance feature after the up-sampling; determining the sub-graph significance distillation loss value according to the characteristic difference between the first sub-graph characteristic after the up-sampling and the sub-graph significance characteristic.

The saliency features of the embodiment of the present application also include original image saliency features and sub-image saliency features, and as described above, the original image saliency features may be obtained by extracting features of an original image by using an image saliency detection sub-model (T2), and the sub-image saliency features may be obtained by randomly cropping (crop) the original image based on the extracted saliency features of the original image. After the original image saliency feature and the sub-image saliency feature are obtained respectively, an original image saliency distillation loss value can be obtained by comparing the feature difference between the original image saliency feature and the first original image feature after the up-sampling, and a sub-image saliency distillation loss value can be obtained by comparing the feature difference between the sub-image saliency feature and the first sub-image feature after the up-sampling.

In one embodiment of the present application, the distillation loss value comprises a characteristic distillation loss value, and the updating the parameter of the first image cropping model according to the first ordering loss value and the distillation loss value comprises: carrying out global pooling on the image saliency characteristics to obtain global pooling characteristics; determining a weight of the characteristic distillation loss value and a weight of the significant distillation loss value according to the global pooling characteristics; and updating the parameters of the first image cropping model according to the first sequencing loss value, the characteristic distillation loss value, the significant distillation loss value and the corresponding weight.

In specific implementation, the two-teacher network (T1+ T2) distillation method provided in the embodiment of the present application may also generate some error cases (images with wrong cropping), and the error cases are analyzed to find that the error cases are mainly classified into the following two types: 1) the original image has obvious significance characteristics and is cut wrongly; 2) the saliency features of the original image are not obvious, at the moment, the network cannot normally extract the aesthetic features of the image due to saliency constraint, and the image is wrongly cropped. Therefore, the loss weight of the characteristic distillation and the significant distillation has a large influence on the model, and if the same weight is used for the characteristic distillation and the significant distillation in the distillation stage, the cropping effect of the model on the image is easily reduced. In order to solve the above problem, the embodiment of the present application provides an adaptive weighting scheme, and the final loss function L may adopt the following formula (2):

L＝α*L_{embeds distill}+(1-α)*L_{saliency distill}+L_ranking， (2)

therein, rightThe weight alpha is obtained by pooling the saliency features through global averaging, and is used for representing the spatial dimension mean value of the saliency of the input image, and the larger alpha is, the more obvious the saliency is. Characteristic distillation loss L_{embeds distill}And significant distillation loss L_{saliency distill}Both comprise an original image and a sub-image cut from the original image.

In one embodiment of the present application, the training image pair is obtained by: grabbing an original image through the Internet; randomly cutting the original image to obtain a subgraph corresponding to the original image; and reconstructing the original image and the corresponding subgraph to obtain the original image and the subgraph with the same size as the training image pair.

In order to reduce the influence of subjective factors of an annotator during artificial annotation, the method is different from a method for calculating an aesthetic score through direct regression.

As shown in FIG. 5, a schematic diagram of a training process of an image cropping model is provided. The method comprises the steps of firstly crawling an original image from the Internet, randomly cropping (random crop) the original image to obtain a subgraph corresponding to the original image, reconstructing (resize) the original image and the subgraph to the same size, then respectively extracting features of the original image and the subgraph by using a first image cropping model to obtain original image features and subgraph features, respectively processing the original image features and the subgraph features by using a full-link layer FC of the first image cropping model to obtain an aesthetic score (score _1) of the original image and an aesthetic score (score _2) of the subgraph, and further obtaining a Ranking Loss (Ranking Loss) of the first image cropping model according to the aesthetic score of the original image and the aesthetic score of the subgraph. And then comparing the image features extracted by the first image cropping model with the image features extracted by a pre-trained image scoring sub-model to respectively obtain the feature distillation loss (Embeds distill loss) corresponding to the original image and the sub-image. And meanwhile, the saliency of the original image is detected by using the image saliency detection sub-model, and the saliency characteristics of the original image corresponding to the original image and the saliency characteristics of the subgraph corresponding to the subgraph are respectively obtained. And respectively up-sampling the original image features and the sub-image features extracted by the first image cropping model, and then respectively comparing the original image features and the sub-image features with the original image significant features and the sub-image significant features obtained by significance detection, so as to respectively obtain significance distillation losses (significance distillation losses) corresponding to the original image and the sub-image. And finally updating the parameters of the first image cropping model based on the sequencing loss, the characteristic distillation loss and the significance distillation loss of the first image cropping model until the model achieves the expected effect, and finishing the training.

The embodiment of the present application provides an image cropping method, as shown in fig. 6, including steps S610 to S630 as follows:

step S610, acquiring an image to be cut.

And S620, scanning the image to be cut by using a sliding window with a preset size to obtain a plurality of sub-images.

Step S630, a first image cropping model is used for scoring each sub-image, and a cropping result of the image to be cropped is output according to the scoring result, wherein the first image cropping model is obtained by training based on the training method of the image cropping model.

When the method is specifically implemented, an image to be cut is obtained, then the image to be cut is scanned by using a sliding window with a given size, and a series of sub-images corresponding to the image to be cut are obtained. And for each sub-image, respectively inputting the sub-image into the first image cropping model for scoring to obtain a scoring result of each sub-image, finally taking the sub-image with the highest score as a final candidate image, and cropping the image to be cropped based on the candidate image to obtain an image cropping result. By the image cropping method, the cropped image with more key information of the original image reserved can be obtained, the response speed of the model is high, and the image cropping efficiency is improved.

The first image cropping model in the embodiment of the application is obtained by training based on the following method: acquiring a first image feature of a training image pair extracted by a first image cropping model, and acquiring a second image feature of the training image pair extracted by a second image cropping model, wherein the training image pair comprises an original image and a subgraph thereof, and the first image feature comprises a first original image feature and a first subgraph feature; according to the first original image feature and the first sub-image feature, determining a grading difference between the original image and the sub-image, and determining a first sequence loss value according to the grading difference; performing knowledge distillation on the first image cropping model according to a second image feature extracted by the second image cropping model, and determining a distillation loss value; updating parameters of the first image cropping model according to the first order loss value and the distillation loss value.

As shown in fig. 7, a schematic diagram of image cropping effects is provided, and fig. 7 shows a given artwork, a cropped image resulting from feature only distillation, and a cropped image resulting from a combination of feature distillation and saliency distillation, from which it can be seen that the retention of key information for the artwork is enhanced for the cropped image in combination with saliency information.

In order to enhance the persuasion of the image cropping method provided by the application, the sequencing Loss (Ranking Loss) on the test set is used as a quantization index, and the smaller the sequencing Loss is, the better the model performance is. The teacher network adopts a heavyweight VGG-19 network, and the student network adopts a pre-designed lightweight VGG network VGG-Tiny. Comparing each scheme that this application provided, the experimental result is as table 1, can see that, comparing in conventional characteristic distillation scheme, the model effect that this application obtained after introducing significance information is more excellent, and combines the distillation loss weight of two teacher's networks of self-adaptation mode adjustment, can further promote the model performance.

TABLE 1

Network	Method of producing a composite material	Loss of ordering
			VGG-19	-	0.0016
VGG-Tiny	Characteristic distillation	0.0082
			VGG-Tiny	Characteristic distillation + significant distillation	0.0060
VGG-Tiny	Characteristic distillation + significant distillation + adaptive weighting	0.0047

As shown in fig. 8, an image cropping effect schematic diagram is provided, and it can be seen that the image cropping method combined with the adaptive weight provided by the embodiment of the present application can effectively alleviate the misjudgment of the model on the low-saliency image (as shown in the last two lines of images in fig. 8), and can also improve the performance of the misjudged image to a certain extent, thereby effectively improving the robustness of the network.

The embodiment of the present application provides an image cropping pattern training device 900, as shown in fig. 9, the image cropping pattern training device 900 includes: a first obtaining unit 910, a first determining unit 920, a second determining unit 930, and an updating unit 940.

The first obtaining unit 910 of the embodiment of the present application is configured to obtain a first image feature of a training image pair extracted by a first image cropping model, and obtain a second image feature of the training image pair extracted by a second image cropping model, where the training image pair includes original image and its subgraph, and the first image feature includes a first original image feature and a first subgraph feature.

The first determining unit 920 according to this embodiment of the application is configured to determine a score difference between the original and the sub-graph according to the first original feature and the first sub-graph feature, and determine a first ordering loss value according to the score difference.

A second determining unit 930 of the embodiment of the present application, configured to perform knowledge distillation on the first image cropping model according to the second image feature extracted by the second image cropping model, and determine a distillation loss value.

An updating unit 940 of the embodiment of the present application is configured to update parameters of the first image cropping model according to the first ordering loss value and the distillation loss value.

In an embodiment of the application, the second image cropping model comprises an image scoring sub-model, the second image feature comprises a second artwork feature and a second sub-graph feature, the distillation loss value comprises an artwork feature distillation loss value and a sub-graph feature distillation loss value, and the second determining unit 930 is further configured to: determining an original image characteristic distillation loss value according to a characteristic difference between the first original image characteristic and the second original image characteristic; determining the subpicture distillation loss value based on the feature difference between the first subpicture and the second subpicture.

In an embodiment of the application, the second image cropping model comprises an image saliency detection sub-model, the second image feature comprises a saliency feature, the distillation loss value comprises a saliency distillation loss value, the first obtaining unit 910 is further configured to: extracting the saliency characteristics of the original image by using an image saliency detection sub-model; the second determination unit is further configured to: up-sampling a first image feature extracted by the first image cropping model; determining the significant distillation loss value according to a feature difference between the upsampled first image feature and the significant feature.

In an embodiment of the application, the saliency characteristics include an original saliency characteristic and a sub-map saliency characteristic, the saliency distillation loss value includes an original saliency distillation loss value and a sub-map saliency distillation loss value, and the second determination unit 930 is further configured to: determining the original drawing significance distillation loss value according to the feature difference between the first original drawing feature and the original drawing significance feature after the up-sampling; determining the sub-graph significance distillation loss value according to the characteristic difference between the first sub-graph characteristic after the up-sampling and the sub-graph significance characteristic.

In an embodiment of the present application, the distillation loss value comprises a characteristic distillation loss value, and the updating unit 940 is further configured to: carrying out global pooling on the image saliency characteristics to obtain global pooling characteristics; determining a weight of the characteristic distillation loss value and a weight of the significant distillation loss value according to the global pooling characteristics; and updating the parameters of the first image cropping model according to the first sequencing loss value, the characteristic distillation loss value, the significant distillation loss value and the corresponding weight.

An embodiment of the present application provides an image cutting apparatus 1000, as shown in fig. 10, the image cutting apparatus 1000 including: a second acquisition unit 1010, a scanning unit 1020, and a trimming unit 1030.

A second acquiring unit 1010 of the embodiment of the present application is configured to acquire an image to be cut.

The scanning unit 1020 of the embodiment of the application is configured to scan the image to be cut by using a sliding window with a preset size to obtain a plurality of sub-images.

The cropping unit 1030 according to the embodiment of the present application is configured to score each sub-image by using a first image cropping model, and output a cropping result of an image to be cropped according to the scoring result, where the first image cropping model is trained based on the training device of the image cropping model as described above.

When the method is specifically implemented, an image to be cut is obtained, then the image to be cut is scanned by using a sliding window with a given size, and a series of sub-images corresponding to the image to be cut are obtained. And for each sub-image, respectively inputting the sub-image into the first image cutting model to score so as to obtain a scoring result of each sub-image, finally, taking the sub-image with the highest score as a final candidate image, and cutting the image to be cut based on the candidate image so as to obtain an image cutting result. By the image cropping method, the cropped image with more key information of the original image reserved can be obtained, the response time of the model is short, and the image cropping efficiency is improved.

The first image cropping model in the embodiment of the application is obtained by training based on the following device: the image processing device comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring first image features of a training image pair extracted by a first image cropping model and acquiring second image features of the training image pair extracted by a second image cropping model, the training image pair comprises original images and subgraphs thereof, and the first image features comprise first original image features and first subgraph features; a first determining unit, configured to determine, according to the first original image feature and the first sub-image feature, a score difference between the original image and the sub-image, and determine a first ordering loss value according to the score difference; a second determination unit, configured to perform knowledge distillation on the first image cropping model according to a second image feature extracted by the second image cropping model, and determine a distillation loss value; an updating unit for updating the parameters of the first image cropping model according to the first ordering loss value and the distillation loss value.

It should be noted that, for the specific implementation of each apparatus embodiment, reference may be made to the specific implementation of the corresponding method embodiment, which is not described herein again.

In summary, according to the technical scheme of the application, a first image feature of a training image pair extracted by a first image cropping model is obtained, and a second image feature of the training image pair extracted by a second image cropping model is obtained, wherein the training image pair comprises an original image and sub-images thereof, and the first image feature comprises a first original image feature and a first sub-image feature; according to the first original image feature and the first sub-image feature, determining a grading difference between the original image and the sub-image, and determining a first sequence loss value according to the grading difference; performing knowledge distillation on the first image cropping model according to a second image feature extracted by the second image cropping model, and determining a distillation loss value; updating parameters of the first image cropping model according to the first order loss value and the distillation loss value. The image cropping model obtained by training can break through the limitation of manual feature design, enable the network to independently learn the implicit features of the image, and improve the robustness of the features. In addition, the network implicit coding image characteristics simultaneously give consideration to the retention degree of important information of the image, and the design of the model in a light weight mode can effectively reduce the feedforward time and enhance the practicability of the model.

It should be noted that:

the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. In addition, this application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the present application.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the image cropping means or training device of the image cropping model according to embodiments of the present application. The present application may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

For example, fig. 11 shows a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 1100 comprises a processor 1110 and a memory 1120 arranged to store computer executable instructions (computer readable program code). The memory 1120 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 1120 has a storage space 1130 storing computer readable program code 1131 for performing any of the method steps described above. For example, the memory space 1130 for storing the computer readable program code may include respective computer readable program codes 1131 for respectively implementing various steps in the above methods. The computer readable program code 1131 may be read from and written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. Such a computer program product is typically a computer readable storage medium such as that shown in fig. 12. FIG. 12 shows a schematic diagram of a computer-readable storage medium according to an embodiment of the present application. The computer readable storage medium 1200 stores computer readable program code 1131 for performing the steps of the method according to the present application, which is readable by the processor 1110 of the electronic device 1100, and when the computer readable program code 1131 is executed by the electronic device 1100, causes the electronic device 1100 to perform the steps of the method described above, in particular the computer readable program code 1131 stored by the computer readable storage medium may perform the method shown in any of the embodiments described above. The computer readable program code 1131 may be compressed in a suitable form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. A training method of an image cropping model is characterized by comprising the following steps:

2. The method of claim 1, wherein the second image crop model comprises an image scoring sub-model, wherein the second image features comprise second original image features and second sub-image features, wherein the distillation loss values comprise original image feature distillation loss values and sub-image feature distillation loss values, wherein the knowledge distillation of the first image crop model based on the second image features extracted by the second image crop model comprises:

3. The method for training an image cropping model as defined in claim 2, wherein the image scoring sub-model is trained by:

4. The method of claim 1, wherein the second image crop model comprises an image saliency detection sub-model, wherein the second image features comprise saliency features, wherein the distillation loss values comprise saliency distillation loss values, and wherein obtaining second image features of the training image pair extracted by the second image crop model comprises:

up-sampling a first image feature extracted by the first image cropping model;

5. The method of claim 4, wherein the saliency features comprise artwork saliency features and sub-graph saliency features, the saliency distillation loss value comprises an artwork saliency distillation loss value and the sub-graph saliency distillation loss value, and the determining the saliency distillation loss value from the feature difference between the upsampled first image feature and the saliency features comprises:

6. The method of claim 4, wherein the distillation loss value comprises a characteristic distillation loss value, and wherein updating the parameters of the first image cropping model according to the first ordering loss value and the distillation loss value comprises:

7. A training method for an image cropping model according to any of claims 1 to 6, characterized in that said pair of training images is obtained by:

grabbing an original image through the Internet;

8. An image cropping method, comprising:

acquiring an image to be cut;

scoring each sub-image by using a first image cropping model, and outputting a cropping result of an image to be cropped according to the scoring result, wherein the first image cropping model is trained based on the training method of the image cropping model according to any one of claims 1 to 7.

9. An image cropping model training device, comprising:

10. An image cutting apparatus, comprising:

a second acquisition unit for acquiring an image to be cut;

a cropping unit configured to score each sub-image using a first image cropping model trained based on the training device of the image cropping model according to claim 9, and output a cropping result of an image to be cropped according to the scoring result.

11. An electronic device, wherein the electronic device comprises: a processor; and a memory arranged to store computer executable instructions that when executed cause the processor to perform a method of training an image cropping model as claimed in any of claims 1 to 7, or an image cropping method as claimed in claim 8.

12. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement a training method of an image cropping model as claimed in any one of claims 1 to 7, or an image cropping method as claimed in claim 8.