WO2023093851A1

WO2023093851A1 - Image cropping method and apparatus, and electronic device

Info

Publication number: WO2023093851A1
Application number: PCT/CN2022/134366
Authority: WO
Inventors: 刘鑫
Original assignee: 维沃移动通信有限公司
Priority date: 2021-11-29
Filing date: 2022-11-25
Publication date: 2023-06-01
Also published as: CN114119373A

Abstract

Disclosed in the present application are an image cropping method and apparatus, and an electronic device. The image cropping method comprises: determining a plurality of cropping candidate regions corresponding to a target image; obtaining image features corresponding to the target image, the image features comprising a first image feature and a second image feature, the first image feature being associated with a first image region corresponding to the plurality of cropping candidate regions, and the second image feature being associated with a second image region, other than the first image region, in the target image; inputting the image features corresponding to the target image into an image evaluation network model to obtain feature scores respectively corresponding to the plurality of cropping candidate regions, the feature score being used for representing at least one of an aesthetic feature and a salient feature of the cropping candidate region; and determining at least one target cropping candidate region according to the feature scores respectively corresponding to the plurality of cropping candidate regions, and cropping the target image according to the target cropping candidate region.

Description

Image cropping method, device and electronic equipment

Cross References to Related Applications

This application claims the priority of a Chinese patent application with application number 202111435959.4 and titled "Image Cropping Method, Apparatus, and Electronic Equipment" filed with the China Patent Office on November 29, 2021, the entire contents of which are hereby incorporated by reference in this application .

technical field

The present application relates to the field of communication technologies, and in particular to an image cropping method, device and electronic equipment.

Background technique

With the rapid development of electronic devices, using electronic devices to capture images has become a common image acquisition method. Images taken by users are stored in the photo album of electronic devices. Since most of the images are not taken by professional photographers, and the images are displayed in different frames, the aesthetic quality of the images is uneven.

When using the images in the album in some specific scenarios, it is necessary to crop the image to show the image characteristics of the image. For example, for the scenarios of displaying desktop widgets, album thumbnails, memory albums or album collection covers, because these display outlets There are often different aspect ratios. If the image is simply cropped to generate a desktop pendant, album thumbnail or album cover, the effect is often poor and the image quality is not good.

overview

The purpose of the embodiments of the present application is to provide an image cropping method, device and electronic equipment, so as to solve the problem of poor image quality obtained during image cropping in the prior art.

In the first aspect, the embodiment of the present application provides an image cropping method, including:

Determining multiple cropping candidate frames corresponding to the target image;

Acquiring image features corresponding to the target image, where the image features include first image features and second image features, the first image features are associated with the first image areas corresponding to the plurality of cropping candidate areas, and the second A second image feature is associated with a second image area in the target image other than the first image area;

Input the image features corresponding to the target image into the image evaluation network model, and obtain the feature scores corresponding to the plurality of cropping candidate regions respectively, and the feature scores are used to characterize the aesthetic features and salient features of the cropping candidate regions at least one;

Determine at least one candidate target cropping area according to feature scores respectively corresponding to the multiple candidate cropping areas, and crop the target image according to the candidate target cropping areas.

In a second aspect, an embodiment of the present application provides an image cropping device, including:

A determining module, configured to determine a plurality of cropping candidate regions corresponding to the target image;

The first acquisition module is configured to acquire image features corresponding to the target image, the image features include first image features and second image features, and the first image features correspond to the first cropping candidate regions. Image area association, the second image feature is associated with a second image area in the target image other than the first image area;

The second acquisition module is configured to input the image features corresponding to the target image into the image evaluation network model, and acquire feature scores respectively corresponding to the plurality of cropping candidate regions, and the feature scores are used to characterize the aesthetics of the cropping candidate regions at least one of characteristics and distinctive features;

A processing module, configured to determine at least one candidate target cropping area according to the feature scores corresponding to the plurality of candidate cropping areas, and crop the target image according to the candidate target cropping area.

In the third aspect, the embodiment of the present application provides an electronic device, the electronic device includes a processor and a memory, the memory stores programs or instructions that can run on the processor, and the programs or instructions are processed by the The steps of the method described in the first aspect are realized when the controller is executed.

In a fourth aspect, an embodiment of the present application provides a readable storage medium, on which a program or an instruction is stored, and when the program or instruction is executed by a processor, the steps of the method described in the first aspect are implemented .

In the fifth aspect, the embodiment of the present application provides a chip, the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is used to run programs or instructions, so as to implement the first aspect the method described.

In a sixth aspect, an embodiment of the present application provides a computer program product, the program product is stored in a storage medium, and the program product is executed by at least one processor to implement the method described in the first aspect.

In a seventh aspect, the embodiment of the present application provides an electronic device configured to execute the method described in the first aspect.

In the embodiment of the present application, by determining a plurality of cropping candidate regions corresponding to the target image, the image features corresponding to the target image including the first image feature associated with the cropping candidate region and the second image feature associated with the non-cropping candidate region are acquired , input the image features corresponding to the target image into the image evaluation network model, obtain the feature scores of at least one of the representative aesthetic features and salient features corresponding to the multiple cropping candidate regions, and according to the feature scores corresponding to the multiple cropping candidate regions , determining at least one target cropping candidate region and cropping the target image can efficiently mine information in the image based on aesthetic and/or salient features, so as to ensure the image cropping effect and obtain a cropped image with good image quality.

Description of drawings

FIG. 1 shows a schematic diagram of an image cropping method provided in an embodiment of the present application;

Figure 2 shows a schematic diagram of the network architecture of the aesthetic evaluation task model provided by the embodiment of the present application;

FIG. 3 shows a schematic diagram of the network architecture of the saliency task model provided by the embodiment of the present application;

Fig. 4 represents the schematic diagram of the evaluation network model based on the strategy-acquired image provided by the embodiment;

Fig. 5 represents the schematic diagram of the image evaluation network model obtained based on strategy two provided by the embodiment;

FIG. 6 shows a schematic diagram of an image cropping device provided in an embodiment of the present application;

Fig. 7 is one of the schematic block diagrams of the electronic device provided by the embodiment of the present application;

FIG. 8 is a second schematic block diagram of an electronic device provided by an embodiment of the present application.

A detailed description

The following will clearly describe the technical solutions in the embodiments of the present application with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, but not all of them. All other embodiments obtained by persons of ordinary skill in the art based on the embodiments in this application belong to the protection scope of this application.

The terms "first", "second" and the like in the specification and claims of the present application are used to distinguish similar objects, and are not used to describe a specific sequence or sequence. It should be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application can be practiced in sequences other than those illustrated or described herein, and that references to "first," "second," etc. distinguish Objects are generally of one type, and the number of objects is not limited. For example, there may be one or more first objects. In addition, "and/or" in the specification and claims means at least one of the connected objects, and the character "/" generally means that the related objects are an "or" relationship.

The image cropping method provided by the embodiment of the present application will be described in detail below through specific embodiments and application scenarios with reference to the accompanying drawings.

The embodiment of the present application provides an image cropping method, as shown in FIG. 1, including:

Step 101. Determine multiple cropping candidate frames corresponding to the target image.

In the image cropping method provided in the embodiment of the present application, firstly, a plurality of cropping candidate regions corresponding to the target image may be determined, so as to determine a final cropping region among the multiple cropping candidate regions. When determining multiple cropping candidate areas, it can be determined according to the preset composition principles. The preset composition principles can be determined based on various photographic composition principles. The various photographic composition principles can include but are not limited to triangular composition principles, diagonal composition principles, The principle of three-point composition, the principle of blank composition at the top of the sky, the principle of blank composition for movement, the principle of balanced and stable composition, etc.

Step 102: Obtain image features corresponding to the target image, the image features include first image features and second image features, the first image features are associated with the first image areas corresponding to the plurality of cropping candidate areas, The second image feature is associated with a second image area in the target image other than the first image area.

For the target image, it is necessary to obtain the image features corresponding to the target image. Before acquiring the image features corresponding to the target image, image data processing is first performed on the target image, and a process of image data processing is briefly introduced below. For the target image, bilinear interpolation is used to adjust it to a size of 256×256, and data enhancement is performed, wherein data enhancement may include mirror processing, random rotation processing, Gaussian noise processing, normalization processing, etc.

After processing the image data of the target image, the target image is input into the backbone network (such as MobileNetv2), and the backbone network outputs features of various scales, and the output features are spliced to obtain the image features corresponding to the target image. The image features corresponding to the target image may include first image features and second image features. The first image feature is a feature corresponding to a plurality of cropping candidate areas, and the multiple cropping candidate areas are associated with the first image area of the target image, that is, the first image feature is associated with the first image area corresponding to the multiple cropping candidate areas Features; the second image feature is a feature associated with the second image area, and the second image area is an image area in the target image that is different from the first image area.

Step 103: Input the image features corresponding to the target image into the image evaluation network model, and obtain the feature scores corresponding to the plurality of cropping candidate regions respectively, and the feature scores are used to characterize the aesthetic features and salience of the cropping candidate regions at least one of the features.

After acquiring the image features corresponding to the target image, the image features corresponding to the target image can be input into the image evaluation network model, and the image evaluation network model outputs the feature scores corresponding to the plurality of clipping candidate regions respectively. Wherein, the image evaluation network model is obtained through model training based on the image saliency information of multiple training images and/or the image aesthetic information of multiple training images, and is used for scoring the cropping candidate regions of the image. The feature score of the cropping candidate frame region corresponding to the target image is used to characterize at least one of the aesthetic feature and the salient feature of the cropping candidate region.

Step 104 is executed after the feature scores corresponding to the plurality of clipping candidate regions are obtained based on the image evaluation network model.

Step 104: Determine at least one candidate target cropping area according to the feature scores respectively corresponding to the plurality of candidate cropping areas, and crop the target image according to the candidate target cropping area.

After determining the feature scores corresponding to the plurality of cropping candidate regions, according to the feature scores corresponding to the plurality of cropping candidate regions, at least one target cropping candidate region is determined in the plurality of cropping candidate regions, and based on the determined at least one target cropping The candidate area crops the target image to obtain the cropped image.

In the above implementation process of the present application, by determining a plurality of cropping candidate regions corresponding to the target image, the image features corresponding to the target image including the first image feature associated with the cropping candidate region and the second image feature associated with the non-cropping candidate region are obtained, Input the image features corresponding to the target image into the image evaluation network model, obtain the feature scores of at least one of the representative aesthetic features and salient features corresponding to the multiple cropping candidate regions, and according to the feature scores corresponding to the multiple cropping candidate regions, Determining at least one target cropping candidate area and cropping the target image can efficiently mine information in the image based on aesthetic and/or salient features, so as to ensure the image cropping effect and obtain a cropped image with good image quality. Wherein, step 101 determines a plurality of cropping candidate regions corresponding to the target image, including:

Divide the target image into a grid anchor form;

determining at least one target grid in the target image in the form of grid anchors based on preset composition principles;

For the at least one target grid, respectively expand according to at least one expansion ratio, and determine the plurality of clipping candidate regions.

When determining multiple cropping candidate regions corresponding to the target image, it is necessary to divide the target image into a grid anchor form, such as dividing the target image into H×W small grid blocks, and then based on the preset composition principle, in the grid anchor Identify at least one object grid in the object image of the form. In this embodiment, taking the preset composition principle as the thirds composition principle as an example, the process of determining the cropping candidate region is introduced. After dividing the target image into a grid anchor form, according to the principle of three-point composition, the target image is divided by the intersection of two horizontal lines and two vertical lines, and the target image is divided into nine large grid blocks of the same size (9 grids). ). Determine the small grids that the four three-point lines pass through and all the small grids contained in the large grid in the center of the target image, determine these small grids as the target grid, and use the center of the target grid as the center of the candidate area for cropping.

After the target grid is determined, at least one determined target grid is expanded according to at least one expansion ratio, so as to determine a plurality of cropping candidate regions. When expanding, the target grid center can be expanded according to various expansion ratios to obtain cropping candidate regions. Wherein, since the sizes of each grid are equal, when expanding, the length and width of the grid are actually expanded to obtain the cropping candidate area, that is, the expansion ratio refers to the aspect ratio of the expanded cropping candidate area, by Expanding based on the target grid, the clipping candidate area including the target grid and the neighborhood grid can be delineated on the basis of the target grid.

It should be noted that the upper left corner and the lower right corner of the cropping candidate region obtained by expanding are located at the center of the small grid. In order to balance the number of cropping candidate regions and the integrity of the original image content, it is also necessary to ensure that the area ratio of the cropping candidate regions to the original image is reasonable (for example, the area ratio can be greater than 0.4).

In the above implementation process of the present application, by dividing the target image into grid anchor forms, based on preset composition principles, at least one target grid is determined in the target image in the grid anchor form, and then expanded for the target grid, it can be achieved in Cropping candidate regions are determined on the basis of the target grid.

In an optional embodiment of the present application, the method also includes:

According to at least one of image saliency information and image aesthetic information corresponding to the plurality of training images, model training is performed to obtain the image evaluation network model.

The embodiment of the present application needs to obtain multiple training images, and perform model training based on at least one of image saliency information and image aesthetic information of the multiple training images to obtain an image evaluation network model. Since the image evaluation network model is used to score images, by acquiring the image evaluation network model, feature scores corresponding to multiple cropping candidate regions of the target image can be obtained based on the image evaluation network model.

Wherein, performing model training according to at least one of image saliency information and image aesthetic information corresponding to a plurality of training images to obtain the image evaluation network model includes:

Acquiring the image aesthetic information of a plurality of cropping candidate regions corresponding to each of the training images, the image aesthetic information including labeling scores and prediction scores of the cropping candidate regions;

performing model training according to at least one of the image aesthetic information of the plurality of cropping candidate regions corresponding to the plurality of training images and the image saliency information respectively corresponding to the plurality of training images, and obtaining the image evaluation network model.

When performing model training according to image saliency information of multiple training images and/or image aesthetic information of multiple training images, corresponding cropping candidate regions may be determined for each training image. The process of determining the cropping candidate region corresponding to the training image is the same as the process of determining the cropping candidate region corresponding to the target image, and will not be further elaborated here.

After the corresponding cropping candidate regions are determined for each training image, labeling scores and prediction scores of the corresponding multiple cropping candidate regions may be acquired for each training image. For each training image, the corresponding image saliency information can be obtained. Then perform model training according to the annotation scores and prediction scores of the cropping candidate regions corresponding to the multiple training images and/or the image saliency information corresponding to the multiple training images, and obtain the image evaluation network model through the model training.

Among them, the image aesthetic information of the training image includes the labeling score and prediction score of the cropping candidate region of the training image, and the labeling score is the score obtained by the labeling personnel aesthetically labeling the cropping candidate region of the training image based on their own aesthetic standards. Value, the prediction score is the score obtained by predicting the features of the cropping candidate regions of the training image based on the multi-layer convolutional layer.

In the above-mentioned implementation process of the present application, by acquiring image saliency information of multiple training images and/or image aesthetic information including labeling scores and prediction scores of candidate cropping regions, according to the labeling scores and The prediction score, and/or, the image saliency information of the training image, performs model training, and obtains the image evaluation network model, which can realize at least one of the saliency features and aesthetic features based on the training image, and is used for image evaluation. Image evaluation network model for scoring aesthetic features and/or salient features.

Optionally, the acquiring the image aesthetic information of the multiple cropping candidate regions corresponding to each of the training images includes:

For each of the training images, obtain the screening results obtained by the labeling staff for at least two screenings of the multiple cropping candidate regions corresponding to the training image, and determine according to the screening results that the multiple cropping candidate regions correspond to label score;

For each of the training images, obtain the feature map of the training image, extract the region of interest (region of interest, RoI) feature and discarding region (region of discard, RoD) of the cropping candidate region on the feature map ) features are combined into target features, and prediction scores corresponding to the plurality of clipping candidate regions are obtained according to the target features.

For each training image, when obtaining the labeling scores of multiple cropping candidate regions corresponding to the training image, the screening results obtained by the labeling staff for at least two screenings of the multiple cropping candidate regions corresponding to the training image can be obtained, and then based on Based on the obtained screening results, labeling scores corresponding to multiple cropping candidate regions of the current training image are determined.

The specific process of obtaining the labeling score corresponding to the clipping candidate region is introduced below. First, for each training image, based on the existing benchmark model, the region images corresponding to the multiple cropping candidate regions corresponding to the training image are scored, and for each expansion ratio, the cropping candidate frames with the top N scores are output, and for each Expand the scale, randomly select K cropping candidate boxes from the unselected cropping candidate boxes, and then mark them by the labeler, so as to filter out some cropping candidate boxes based on the benchmark model.

Considering the high similarity of cropping candidate regions, it is more difficult for labelers to directly score (for example, 0 to 5 points). Therefore, for each expansion ratio, the output cropping candidate regions can be combined into a first candidate pool, and the annotators select n (for example, 3 to 5) cropping candidates for each first candidate pool according to the expansion ratio. The values of n and N can be the same or different. After the selection, the selected cropping candidate regions and some randomly mixed cropping candidate regions (unselected partial cropping candidate regions) form the second candidate pool, and then perform secondary selection therefrom to select m (as can is 3 to 5) optimal cropping candidate regions. When screening the candidate cropping regions, the candidate cropping regions may be screened based on region images corresponding to the candidate cropping regions.

After the labeler completes the screening, the electronic device can determine the labeling scores corresponding to the multiple cropping candidate regions according to the screening results. For example, the labeling score corresponding to the cropping candidate region selected twice is 2 points, and The labeling score corresponding to the clipping candidate area is 1 point, and the labeling score corresponding to the cropping candidate area that is not selected is 0 point. It should be noted that, considering aesthetic differences, each training image can be screened by multiple annotators for cropping candidate regions, and then the final screening result is determined based on the selections of multiple annotators.

For each training image, when obtaining the prediction scores of multiple cropping candidate regions corresponding to the training image, the first backbone network can be used to extract features of different scales for the training image, and then perform feature splicing to form a feature map. For example, the output of the 7th layer, 14th layer and the last layer of MobileNetv2 are processed by upsampling and downsampling, and feature splicing is performed on the channel to form a feature map. Then 1×1 convolution is used for channel dimensionality reduction. Apply the region of interest operator (RoIAlign) and discarding region operator (RoDAlign) on the feature map according to the clipping candidate area to extract the RoI feature and RoD feature of the cropping candidate area, and splicing the RoI feature and RoD feature to form the final feature (target feature ), the final feature is sent to the multi-layer convolutional layer to output the prediction score s _score of the clipping candidate area. Among them, the multi-layer convolutional layer is part of the aesthetic evaluation task model.

The above implementation process of this application can determine the labeling scores corresponding to the cropping candidate regions according to the screening results corresponding to the selection of the cropping candidate regions by the labeler, determine the target feature based on the RoI feature and RoD feature of the cropping candidate region, and obtain the cropping based on the target feature. The prediction score of the candidate area can then obtain the image aesthetic information of the training image.

Optionally, perform model training according to at least one of the image aesthetic information of the plurality of cropping candidate regions respectively corresponding to the plurality of training images, and the image saliency information corresponding to the plurality of training images respectively, to obtain The image evaluation network model includes one of the following solutions:

Perform aesthetic evaluation task training according to the labeling scores and prediction scores of the plurality of cropping candidate regions respectively corresponding to the plurality of training images, and determine the aesthetic evaluation task model, where the aesthetic evaluation task model is the image evaluation network model;

Perform saliency task training according to the saliency grayscale images and saliency map prediction results respectively corresponding to a plurality of training images, and determine a saliency task model, where the saliency task model is the image evaluation network model, and the saliency task model is determined. Image saliency information includes saliency grayscale map and saliency map prediction results;

Performing model training to acquire the image evaluation network model according to the annotation scores and prediction scores of the plurality of cropping candidate regions corresponding to the plurality of training images and the image saliency information corresponding to the plurality of training images respectively.

When performing model training according to the image aesthetic information of multiple cropping candidate regions corresponding to multiple training images, the aesthetic evaluation task training can be performed according to the labeling scores and prediction scores of the cropping candidate regions corresponding to multiple training images respectively, To determine the aesthetic evaluation task model, the aesthetic evaluation task model trained at this time is the image evaluation network model.

When performing model training based on the image saliency information corresponding to multiple training images, the saliency task training can be performed according to the saliency grayscale map and saliency map prediction results corresponding to multiple training images to determine the saliency task Model, the saliency task model trained at this time is the image evaluation network model. During this implementation process, the image saliency information includes the saliency grayscale map and the prediction result of the saliency map.

When performing model training according to the image aesthetic information and image saliency information corresponding to multiple training images, the labeling scores and prediction scores of the multiple cropping candidate regions corresponding to the multiple training images and the multiple training images can be respectively The corresponding image saliency information is used for model training, and the image evaluation network model obtained at this time can score the aesthetic features and saliency features of the image.

In the above process, three models can be trained according to different features, so that image scoring can be performed based on any one of the models. For the aesthetic evaluation task model, the aesthetic feature score of the image can be performed; for the saliency task model, the saliency feature score of the image can be scored; for the model trained based on image aesthetic information and image saliency information, the aesthetic feature of the image can be performed and significant feature scores.

The training process of the three models is introduced respectively in the following. For the aesthetic evaluation task model, the aesthetic evaluation task training is carried out according to the labeling scores and prediction scores of multiple cropping candidate regions corresponding to a plurality of the training images, and the aesthetic evaluation task model is determined, including:

For each of the training images, the aesthetic evaluation task loss is determined according to the annotation scores and prediction scores of multiple cropping candidate regions corresponding to the training images;

According to the aesthetic evaluation task loss, update the model parameters of the aesthetic evaluation task model, so as to perform model training to determine the aesthetic evaluation task model.

When performing model training based on training images to determine the aesthetic evaluation task model, for each training image, the aesthetic evaluation task loss can be determined according to the labeling scores and prediction scores of multiple cropping candidate regions corresponding to the training image, based on the aesthetic evaluation task The loss updates the model parameters of the aesthetic evaluation task model. Specifically, after the corresponding aesthetic evaluation task loss is obtained according to the first training image, the model parameters of the aesthetic evaluation task model are updated according to the aesthetic evaluation task loss, and then based on the updated aesthetic evaluation task model based on the model parameters, the corresponding Based on the prediction scores of the multiple cropping candidate regions of the second training image, the corresponding aesthetic evaluation task loss is determined based on the labeling scores and prediction scores of the multiple cropping candidate regions corresponding to the second training image, and the aesthetic evaluation is updated again according to the aesthetic evaluation task loss The model parameters of the task model continue to execute the aesthetic evaluation task model updated based on the model parameters, obtain the prediction scores of multiple cropping candidate regions corresponding to the next training image and determine the aesthetic evaluation task loss, and update the model parameters based on the aesthetic evaluation task loss The process until it is determined that the loss of the aesthetic evaluation task satisfies the preset conditions, and it is determined that the model training is successful.

Among them, the network architecture of the aesthetic evaluation task model can be seen in Figure 2, that is, the aesthetic evaluation task model includes a first backbone network, a multi-layer convolutional layer and a feature acquisition architecture between the two, and the first backbone network is used to acquire The features of different scales of the image are then spliced by the feature acquisition framework to obtain the feature map. Based on the feature map, the RoI feature and RoD feature of the candidate area are extracted and combined into the target feature, and the target feature is input into the multi-layer convolution layer to obtain the image. Prediction scores for multiple crop candidate regions. Among them, the process of updating the model parameters of the aesthetic evaluation task model can be understood as the process of updating the parameters of the first backbone network and the parameters of the multi-layer convolutional layer.

The process of determining the loss of the aesthetic evaluation task according to the labeling scores and prediction scores of the multiple cropping candidate regions corresponding to the training images is described below. For the training image, the labeling scores s _gd of the cropping candidate regions are divided in pairs to obtain the labeling score difference matrix S _gd (the number of rows and columns is equal, which is related to the number of cropping candidate regions, such as 5 cropping candidate regions, then is a 5*5 matrix). Construct the first all-zero matrix according to the marked score difference matrix S _gd , and then modify the first all-zero matrix according to the first principle to obtain the first matrix, wherein the first principle is: diagonal elements Set to 0, set the elements of the same proportion to 1, set the difference element between the optimal clipping candidate area and other clipping candidate areas to 2, and set the element in the first matrix corresponding to 0 in the label score difference matrix S _gd to 0, and finally get an effective The image pair matrix P, namely the first matrix.

For the training image, the prediction score s _score of the clipping candidate area is subtracted in pairs to obtain the prediction score difference matrix S (the number of rows and columns is equal). Construct the second all-zero matrix according to the predicted score difference matrix S, modify the second all-zero matrix according to the second principle to obtain the second matrix G, the second principle is: the elements in the predicted score difference matrix S are greater than 0, then the elements in G Set to 1, otherwise the elements in G are set to -1. The edge matrix M is obtained by G*S _gd .

After determining the annotation score difference matrix S _gd , the first matrix P, the prediction score difference matrix S, the second matrix G, and the edge matrix M, the ranking loss RankLoss and the score loss ScoreLoss can be calculated.

Among them, the ranking loss RankLoss is determined according to the predicted score difference matrix S, the second matrix G, the first matrix P, and the edge matrix M, specifically: calculate the product of S and negative G, and then add it to M, the obtained matrix and Multiply P to get the target matrix, compare the elements in the target matrix with 0, take the maximum value, and update the elements in the target matrix. Then for the updated target matrix, accumulate the elements in the target matrix to obtain the first value, and for the first matrix P, accumulate the elements in the first matrix P to obtain the second value, and calculate the first value and the second value The ratio is the ranking loss RankLoss. The calculation expression of sorting loss can be:

RankLoss=sum(max((S*(-G)+M)*P), 0)/sum(P)

In the above expression -G means that the elements in the second matrix are negative, (S*(-G)+M) means the product of S and negative G, and then added to M, and then the resulting matrix is multiplied by the P matrix, Get the target matrix. (max((S*(-G)+M)*P), 0) indicates that the elements in the target matrix are compared with 0 respectively, and the maximum value is taken to update the elements in the target matrix. sum(max((S*(-G)+M)*P), 0) means to add up each element in the target matrix for the updated target matrix, and sum(P) means to add up each element in the first matrix.

Among them, the score loss ScoreLoss is determined according to the prediction score s _score of the clipping candidate area and the labeling score s _gd of the clipping candidate area. Specifically, it is calculated based on the L1 smoothing loss function. The calculation expression of the score loss can be:

ScoreLoss=SmoothL1Loss(s _score , s _gd ), where SmoothL1Loss is the L1 smoothing loss function.

Among them, s _score is the prediction score of the clipping candidate region, s _gd is the labeling score of the cropping candidate region, and the aesthetic evaluation task loss is the weighted sum of ranking loss and score loss.

The above-mentioned implementation process of the present application can determine the predicted score difference matrix and the labeled score difference matrix based on the predicted score and labeling score of the clipping candidate area, determine the sorting loss based on the two matrices, and determine the ranking loss according to the predicted score and labeling of the cropping candidate area. Calculate the score loss based on the score value, determine the aesthetic evaluation task loss based on the weighted sum of the ranking loss and the score loss, and adjust the parameters based on the aesthetic evaluation task loss to train the aesthetic evaluation task model.

For the saliency task model, performing saliency task training and determining the saliency task model according to the saliency grayscale images and saliency map prediction results respectively corresponding to the plurality of training images, including:

For each of the training images, determining the saliency task loss according to the saliency grayscale map corresponding to the training image and the saliency map prediction result;

According to the saliency task loss, model parameters of the saliency task model are updated to perform model training to determine the saliency task model. When determining the saliency task model, for each training image, the saliency task loss can be determined according to the saliency grayscale image corresponding to the training image and the prediction result of the saliency map, and then the model of the saliency task model can be updated based on the saliency task loss parameter. Specifically: for the first training image, obtain the saliency grayscale image of the first training image (the saliency grayscale image can be determined based on a well-trained large salient object detection (SOD) network model), and then use The initial saliency task model obtains the saliency map prediction result corresponding to the first training image, and determines the saliency task loss of the first training image according to the saliency grayscale map and the saliency map prediction result. The model parameters of the saliency task model are updated according to the saliency task loss, and then the saliency map prediction results corresponding to the second training image are obtained based on the updated saliency task model based on the model parameters, and based on the saliency gray corresponding to the second training image Determine the saliency task loss of the second training image based on the prediction results of the degree map and saliency map, update the saliency task model according to the saliency task loss again, continue to execute the saliency task model updated based on the model parameters, and obtain the next training image corresponding to The process of predicting the results of the saliency map and determining the saliency task loss, and updating the model parameters based on the saliency task loss, until the saliency task loss meets the preset conditions, and the model training is confirmed to be successful.

Among them, the calculation process of the saliency task loss is: SalLoss=BCEWithLogitsLoss(s _pred , s _sod ), BCEWithLogitsLoss is the cross-entropy loss function, SalLoss is the saliency task loss, s _pred is the prediction result of the saliency map, s _sod is the saliency grayscale image.

The network architecture of the saliency task model can be seen in Figure 3, that is, the saliency task model includes the first backbone network, cross-stage convolution and multi-layer convolutional layers (which can be compared with the multi-layer convolutional layers of the aesthetic evaluation task model). difference), the first backbone network is used to obtain different scale features of the image, and use cross-stage convolution for feature fusion. In order to extract multi-scale features at the granular level, each scale of the feature is composed of a set of parallel with different expansion rates Convolution processing, and then generate the highest resolution features through 1 × 1 convolution across stages, and finally output the saliency map prediction results through multiple convolutional layers. Among them, the process of updating the model parameters of the saliency task model can be understood as the process of updating the parameters of the first backbone network, the parameters of the cross-stage convolution and the parameters of the multi-layer convolution layer.

In the above implementation process of this application, on the entire image, the multi-layer features of the first backbone network are fused by using cross-stage convolution to predict the salient regions in the image, and the cross-entropy loss function is used to predict the results and saliency based on the saliency map. Salient task loss is computed on the grayscale image for model training based on the saliency task loss.

Wherein, when the saliency task model is the image evaluation network model, the image features corresponding to the target image are input into the image evaluation network model, and the feature scores corresponding to the plurality of clipping candidate regions are respectively obtained. ,include:

inputting image features corresponding to the target image into the saliency task model, and obtaining saliency feature information corresponding to each pixel of the target image;

For each candidate cropping region of the target image, according to the salient feature information corresponding to the pixels included in the candidate cropping region, the feature score corresponding to the candidate cropping region is determined.

For the case where the saliency task model is an image evaluation network model, when the feature scores corresponding to multiple cropping candidate regions of the target image are obtained based on the saliency task model, the image features corresponding to the target image can be input into the saliency task model, To obtain the salient feature information corresponding to each pixel of the target image. It can be understood that each pixel point of the target image corresponds to a saliency feature value. Then, for each cropping candidate region of the target image, according to the salient feature information corresponding to the pixels included in the cropping candidate region (the salient feature value corresponding to each pixel included), determine the corresponding feature of the cropping candidate region Score, the feature score at this time is the score that characterizes the significant feature. The higher the proportion of pixels with high salient feature values included in the cropping candidate region, the higher the feature score corresponding to the cropping candidate region.

In the above implementation process of the present application, when using the saliency task model to determine the feature score corresponding to the crop candidate area, the saliency feature information corresponding to the target image can be obtained, and the corresponding saliency feature information can be determined based on the saliency feature information of the pixel corresponding to the crop candidate area. The feature score can be used to determine the corresponding feature score based on the position of the cropping candidate area on the target image.

The following is an introduction to the process of model training based on the annotation scores and prediction scores of multiple cropping candidate regions corresponding to multiple training images and the image saliency information corresponding to multiple training images respectively. This process corresponds to two implementation strategies, and the first implementation strategy will be introduced below.

The model training is performed according to the labeling scores and prediction scores of the multiple cropping candidate regions corresponding to the multiple training images and the image saliency information respectively corresponding to the multiple training images, and the image evaluation network is obtained. models, including:

Determining an aesthetic evaluation task model according to the labeling scores and prediction scores of a plurality of cropping candidate regions respectively corresponding to a plurality of training images;

determining a saliency task model according to the saliency grayscale maps and saliency map prediction results respectively corresponding to the plurality of training images, and the image saliency information includes the saliency grayscale map and saliency map prediction results;

Joint training is performed based on the aesthetic evaluation task model and the saliency task model to obtain the image evaluation network model.

When performing model training according to the training images, the aesthetic evaluation task training can be carried out according to the labeling scores and prediction scores of the multiple cropping candidate regions corresponding to the multiple training images, and the aesthetic evaluation task model can be determined; according to the multiple training images respectively The corresponding saliency grayscale map and saliency map prediction results are carried out for saliency task training to determine the saliency task model. At this time, the image saliency information includes the saliency grayscale map and saliency map prediction results; then based on the aesthetic evaluation task The model and the saliency task model are jointly trained to obtain the image evaluation network model.

Through joint learning based on the saliency task model and the aesthetic evaluation task model, both task models can be optimized simultaneously. The saliency task model urges the model to focus on and preserve visually salient regions in the original image, and the aesthetic evaluation task model urges the model to focus on better aesthetic compositions. The joint learning of the two task models can ensure that the crop candidate region with the highest aesthetic evaluation score is selected among all crop candidate regions containing salient features, so that the cropped image has a high aesthetic score while retaining the original salient features of the image.

Among them, after the aesthetic evaluation task model and the saliency task model are determined, the following steps are included when performing joint training based on the aesthetic evaluation task model and the saliency task model to obtain the image evaluation network model:

determining a target loss based on the aesthetic evaluation task loss and the saliency task loss;

According to the target loss, update the model parameters of the aesthetic evaluation task model and the saliency task model to perform model training and obtain the image evaluation network model.

After determining the aesthetic evaluation task model and the saliency task model, the target loss can be determined according to the aesthetic evaluation task loss corresponding to the aesthetic evaluation task model and the saliency task loss corresponding to the saliency task model. Here, the aesthetic evaluation task loss is the final loss corresponding to the aesthetic evaluation task model, and the saliency task loss is the final loss corresponding to the saliency task model.

Target loss loss=SalLoss+αRankLoss+βScoreLoss, where α and β are trade-off parameters, and usually the values of α and β can be 1.

The process of updating the model parameters of the image evaluation network model based on the target loss is to update the parameters of the first backbone network of the saliency task model, the parameters of the cross-stage convolution and the multi-layer convolution layer, and update the parameters of the aesthetic evaluation task model. The process of updating the parameters of the first backbone network and the parameters of the multi-layer convolutional layer. Among them, the saliency task model and the aesthetic evaluation task model can share the first backbone network.

After updating the model parameters of the image evaluation network model, the aesthetic evaluation task loss and the saliency task loss can be re-determined, and then the target loss can be determined, and then continue to update the model parameters based on the target loss, and repeat the above process until the target loss meets the preset The condition determines that the model was trained successfully. After acquiring the image evaluation network model, the image evaluation network model can be used to crop the target image.

The following describes the process of obtaining the image evaluation network model corresponding to the strategy through a specific implementation process, as shown in Figure 4, including the following steps:

Step 401 , for multiple training images, determine multiple cropping candidate regions corresponding to each training image.

Step 402 , for each training image, determine labeling scores of multiple cropping candidate regions corresponding to the training image.

Step 403: Perform image data processing on the training image, and acquire features of different scales of the training image based on the first backbone network. Step 404 and step 405 are respectively executed after step 403 .

Step 404 : Perform aesthetic evaluation task model training based on different scale features of the training image and labeling scores of cropping candidate regions corresponding to the training image.

Step 405: Perform saliency task model training based on different scale features of the training image and the saliency grayscale image of the training image. The saliency grayscale image may be acquired in advance.

Step 406, performing joint training based on the aesthetic evaluation task model and the saliency task model to determine an image evaluation network model.

The above-mentioned process of determining the image evaluation network model based on strategy 1 can use the saliency task as a subtask of joint learning to train the deep network, which can effectively combine the saliency information of the image without increasing the complexity and reasoning performance of the network. , so that the trained image evaluation network model can well output cropped images combining aesthetic and salient features.

The above describes the process of obtaining the image evaluation network model based on the first strategy, and the following describes the process of obtaining the image evaluation network model based on the second strategy. The model training is performed according to the labeling scores and prediction scores of the multiple cropping candidate regions corresponding to the multiple training images and the image saliency information respectively corresponding to the multiple training images, and the image evaluation network is obtained. models, including:

For each of the training images, generating salient image features according to the training image and the salient grayscale image corresponding to the training image, the image salient information includes the salient image features, and the salient image features Including the RoI feature and RoD feature of the clipping candidate region;

For each of the training images, updating the prediction scores of the plurality of cropping candidate regions corresponding to the training images according to the salient image features corresponding to the training images;

Perform model training according to the labeling scores and updated prediction scores of the multiple cropping candidate regions respectively corresponding to the multiple training images, and acquire the image evaluation network model.

When performing model training to obtain an image evaluation network model, for each training image in multiple training images, first obtain the saliency grayscale image of the training image based on a large-scale SOD network model with complete training, and then convert the saliency grayscale image to Input the second backbone network to obtain the salient features of different scales, input the training image into the first backbone network to obtain the different scale features of the training image, and output the salient features of different scales output by the second backbone network and the different scale features output by the first backbone network Splicing is performed to generate salient image features. At this time, the image saliency information includes the salient image features generated by splicing, and the salient image features include the RoI features and RoD features of the training image, and can further include the RoI features and RoD features of the cropping candidate area.

For each training image, when updating the prediction scores of multiple cropping candidate regions corresponding to the current training image according to the salient image features corresponding to the current training image, when obtaining the prediction scores of multiple cropping candidate regions corresponding to the training image, It is necessary to extract the RoI feature and RoD feature of the cropping candidate area, and the salient image features include the RoI feature and RoD feature of the cropping candidate area, so the RoI feature and RoD feature of the cropping candidate area of the training image can be updated based on the salient image features of the training image , and then update the prediction score of the clipping candidate region.

Then, model training is performed according to the annotation scores and updated prediction scores of multiple cropping candidate regions corresponding to multiple training images to obtain an image evaluation network model, and the process of model training will not be described here.

The following describes the process of obtaining the image evaluation network model corresponding to strategy 2 through a specific implementation process, as shown in Figure 5, including the following steps:

Step 501. For multiple training images, determine multiple cropping candidate regions corresponding to each training image.

Step 502, for each training image, determine the labeling scores of multiple cropping candidate regions corresponding to the training image.

Step 503, perform image data processing on the training image. After step 503, step 504 and step 505 are executed respectively.

Step 504, acquiring different scale features of the training image based on the first backbone network.

Step 505: Obtain a saliency grayscale image of the training image, and obtain salient features of different scales corresponding to the saliency grayscale image based on the second backbone network.

Step 506 is executed after step 504 and step 505 .

Step 506 : Concatenate salient features of different scales corresponding to the saliency grayscale image with features of different scales of the training image to obtain salient image features.

Step 507: Input the salient image features of the training images into the aesthetic evaluation task model for model training, and obtain the image evaluation network model.

The above-mentioned process of determining the image evaluation network model based on strategy 2 directly obtains the saliency grayscale image of the original image, and then extracts high-level salient features through the backbone network, splicing the salient features and the features of the original image and sending them into the aesthetic evaluation task model Learning in , can achieve a more direct use of salient features.

In an optional embodiment of the present application, the determining at least one target cropping candidate region according to the feature scores respectively corresponding to the plurality of cropping candidate regions includes:

Among the plurality of feature scores, at least one target feature score greater than a preset score threshold is screened out;

The cropping candidate region corresponding to the target feature score is determined as the target cropping candidate region.

When at least one target cropping candidate region is determined according to multiple feature scores of multiple cropping candidate regions corresponding to the target image, the corresponding feature scores can be sorted from high to low for each expansion ratio, based on the sorting As a result, the target feature score greater than the preset score threshold is screened out, so as to determine at least one target feature score for each expansion ratio, and then the cropping candidate area corresponding to the target feature score is determined as the target cropping candidate area, and the cropping candidate area based on The original image is cropped to obtain a cropped image representing aesthetic and/or salient features.

The above is the image cropping method provided by the embodiment of the present application. By determining a plurality of cropping candidate regions corresponding to the target image, the target image corresponding to the first image feature associated with the cropping candidate region and the second image feature associated with the non-cropping candidate region are obtained. The image feature of the image feature, the image feature corresponding to the target image is input into the image evaluation network model, and the feature score of at least one of the representative aesthetic features and the salient features corresponding to the plurality of cropping candidate regions is obtained, and according to the plurality of cropping candidate regions Respectively corresponding feature scores, determine at least one target cropping candidate area and crop the target image, based on aesthetic and/or salient features, efficiently mine the information in the image to ensure the image cropping effect and obtain good image quality cropping image.

Further, by dividing the target image into the grid anchor form, based on the preset composition principle, at least one target grid is determined in the target image of the grid anchor form, and then expanded for the target grid, the target grid can be The cropping candidate area is determined on the basis of , and the selection of the cropping candidate area is realized.

Image-based saliency can be achieved by obtaining image saliency information of multiple training images and/or image aesthetic information including annotation scores and prediction scores of cropping candidate regions, performing model training based on training images, and obtaining image evaluation network models. sexual features and/or aesthetic features, and obtain an image evaluation network model for scoring aesthetic features and/or salient features for images.

By determining the image evaluation network model based on different methods, the determination method of the image evaluation network model is enriched; by determining the target crop candidate area based on the feature score and then performing image cropping, it is possible to efficiently mine images based on aesthetic and/or salient features. information to ensure image cropping effect.

The image cropping method provided in the embodiment of the present application may be executed by an image cropping device. In the embodiment of the present application, the image cropping device provided in the embodiment of the present application is described by taking the image cropping method performed by the image cropping device as an example.

The embodiment of the present application also provides an image cropping device, as shown in FIG. 6 , including:

A determining module 601, configured to determine a plurality of cropping candidate regions corresponding to the target image;

The first acquiring module 602 is configured to acquire image features corresponding to the target image, where the image features include a first image feature and a second image feature, and the first image feature is the first image feature corresponding to the plurality of cropping candidate regions. an image area association, the second image feature is associated with a second image area in the target image other than the first image area;

The second acquiring module 603 is configured to input the image features corresponding to the target image into the image evaluation network model, and acquire feature scores respectively corresponding to the plurality of cropping candidate regions, and the feature scores are used to characterize the cropping candidate regions at least one of aesthetic and distinctive features;

The processing module 604 is configured to determine at least one candidate target cropping area according to feature scores respectively corresponding to the multiple candidate cropping areas, and crop the target image according to the candidate target cropping areas.

Optionally, the determination module includes:

A division sub-module for dividing the target image into grid anchor forms;

The first determination submodule is used to determine at least one target grid in the target image in the grid anchor form based on preset composition principles;

The second determination sub-module is configured to respectively expand the at least one target grid according to at least one expansion ratio, and determine the plurality of cropping candidate regions.

Optionally, the device also includes:

The training acquisition module is used to perform model training according to at least one of image saliency information and image aesthetic information corresponding to the plurality of training images to acquire the image evaluation network model.

Optionally, the training acquisition module includes:

An acquisition sub-module, configured to acquire the image aesthetic information of multiple cropping candidate regions corresponding to each of the training images, where the image aesthetic information includes labeling scores and prediction scores of the cropping candidate regions;

The training acquisition sub-module is used to perform model training according to at least one of the image aesthetic information of the plurality of cropping candidate regions corresponding to the plurality of training images and the image saliency information corresponding to the plurality of training images respectively. To train, obtain the image evaluation network model.

Optionally, the acquisition submodule includes:

The first processing unit is configured to, for each of the training images, obtain screening results obtained by annotators screening multiple cropping candidate regions corresponding to the training images at least twice, and determine the Marking scores corresponding to multiple cropping candidate regions;

The second processing unit is used to obtain the feature map of the training image for each of the training images, extract the RoI feature and RoD feature of the cropping candidate region on the feature map and combine them into target features, according to the The target features are used to obtain prediction scores corresponding to the plurality of clipping candidate regions.

Optionally, the training acquisition submodule includes one of the following units:

The first training unit is configured to perform aesthetic evaluation task training and determine an aesthetic evaluation task model according to the labeling scores and prediction scores of a plurality of clipping candidate regions respectively corresponding to the plurality of training images, and the aesthetic evaluation task model is the image evaluation network model;

The second training unit is configured to perform saliency task training and determine a saliency task model according to the saliency grayscale images and saliency map prediction results respectively corresponding to the plurality of training images, and the saliency task model is the An image evaluation network model, wherein the image saliency information includes a saliency grayscale map and a saliency map prediction result;

The third training unit is configured to perform model training according to the labeling scores and prediction scores of the plurality of clipping candidate regions respectively corresponding to the plurality of training images and the image saliency information corresponding to the plurality of training images respectively, to obtain The image evaluation network model.

Optionally, when the saliency task model is the image evaluation network model, the second acquisition module is further used for:

Optionally, the first training unit includes:

The first determining subunit is configured to determine an aesthetic evaluation task loss for each of the training images according to the labeling scores and prediction scores of the multiple cropping candidate regions corresponding to the training images;

The first update subunit is configured to update the model parameters of the aesthetic evaluation task model according to the aesthetic evaluation task loss, so as to perform model training to determine the aesthetic evaluation task model.

Optionally, the second training unit includes:

The second determination subunit is configured to determine the saliency task loss for each of the training images according to the saliency grayscale map corresponding to the training image and the saliency map prediction result;

The second updating subunit is configured to update the model parameters of the saliency task model according to the saliency task loss, so as to perform model training to determine the saliency task model.

Optionally, the third training unit includes:

The third determining subunit is used to determine the aesthetic evaluation task model according to the labeling scores and prediction scores of the multiple cropping candidate regions respectively corresponding to the multiple training images;

The fourth determining subunit is used to determine the saliency task model according to the saliency grayscale images and saliency image prediction results respectively corresponding to the plurality of training images, and the image saliency information includes the saliency grayscale image and the saliency image Sex map prediction results;

The first obtaining subunit is configured to perform joint training based on the aesthetic evaluation task model and the saliency task model, and obtain the image evaluation network model.

Optionally, the third training unit includes:

A generation subunit is configured to, for each of the training images, generate salient image features according to the training images and the salient grayscale images corresponding to the training images, the image salient information including the salient image features, And the salient image feature includes the RoI feature and the RoD feature of the clipping candidate region;

A third updating subunit, configured to, for each of the training images, update the prediction scores of the plurality of cropping candidate regions corresponding to the training images according to the salient image features corresponding to the training images;

The second acquisition subunit is configured to perform model training according to the labeling scores and updated prediction scores of the plurality of cropping candidate regions respectively corresponding to the plurality of training images, and acquire the image evaluation network model.

The image cropping device provided in the embodiment of the present application, by determining a plurality of cropping candidate regions corresponding to the target image, obtains the first image feature associated with the cropping candidate region and the second image feature associated with the non-cropping candidate region corresponding to the target image The image features corresponding to the target image are input into the image evaluation network model, and the feature scores of at least one of the representative aesthetic features and salient features corresponding to the multiple cropping candidate regions are obtained. According to the multiple cropping candidate regions corresponding to To determine at least one target cropping candidate region and crop the target image, based on the aesthetic and/or salient features, the information in the image can be efficiently mined to ensure the image cropping effect and obtain a cropped image with good image quality.

The image cropping apparatus in the embodiment of the present application may be an electronic device, or may be a component in the electronic device, such as an integrated circuit or a chip. The electronic device may be a terminal, or other devices other than the terminal. Exemplarily, the electronic device can be a mobile phone, a tablet computer, a notebook computer, a handheld computer, a vehicle electronic device, a mobile Internet device (Mobile Internet Device, MID), an augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) ) equipment, robots, wearable devices, ultra-mobile personal computer (ultra-mobile personal computer, UMPC), netbook or personal digital assistant (personal digital assistant, PDA), etc., can also serve as server, network attached storage (Network Attached Storage, NAS), personal computer (personal computer, PC), television (television, TV), teller machine, or self-service machine, etc., which are not specifically limited in this embodiment of the present application.

The image cropping device in the embodiment of the present application may be a device with an operating system. The operating system may be an Android operating system, an iOS operating system, or other possible operating systems, which are not specifically limited in this embodiment of the present application.

The image cropping device provided in the embodiment of the present application can realize various processes implemented in the embodiment of the image cropping method shown in FIG. 1 , and details are not repeated here to avoid repetition.

Optionally, as shown in FIG. 7 , the embodiment of the present application further provides an electronic device 700, including a processor 701, a memory 702, and programs or instructions stored in the memory 702 and operable on the processor 701, When the program or instruction is executed by the processor 701, the various processes of the above-mentioned image cropping method embodiment can be achieved, and the same technical effect can be achieved. To avoid repetition, details are not repeated here.

It should be noted that the electronic devices in the embodiments of the present application include the above-mentioned mobile electronic devices and non-mobile electronic devices.

FIG. 8 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 800 includes, but is not limited to: a radio frequency unit 801, a network module 802, an audio output unit 803, an input unit 804, a sensor 805, a display unit 806, a user input unit 807, an interface unit 808, a memory 809, and a processor 810, etc. .

Those skilled in the art can understand that the electronic device 800 can also include a power supply (such as a battery) for supplying power to various components, and the power supply can be logically connected to the processor 810 through the power management system, so as to manage charging, discharging and power consumption through the power management system. Management and other functions. The structure of the electronic device shown in FIG. 8 does not constitute a limitation to the electronic device. The electronic device may include more or fewer components than shown in the figure, or combine some components, or arrange different components, and details will not be repeated here. .

Wherein, the processor 810 is configured to: determine a plurality of cropping candidate regions corresponding to the target image; obtain image features corresponding to the target image, the image features include a first image feature and a second image feature, and the first image feature Associated with the first image area corresponding to the plurality of cropping candidate areas, the second image feature is associated with a second image area in the target image other than the first image area; The image features are input into the image evaluation network model, and the feature scores corresponding to the plurality of cropping candidate regions are obtained, and the feature scores are used to characterize at least one of the aesthetic features and salient features of the cropping candidate regions; according to the The feature scores corresponding to the plurality of cropping candidate regions are used to determine at least one target cropping candidate region, and the target image is cropped according to the target cropping candidate region.

Optionally, when determining a plurality of cropping candidate regions corresponding to the target image, the processor 810 is further configured to: divide the target image into grid anchor forms; based on preset composition principles, the target image in the grid anchor form Determine at least one target grid; for the at least one target grid, respectively expand according to at least one expansion ratio, and determine the plurality of clipping candidate regions.

Optionally, the processor 810 is further configured to: perform model training according to at least one of image saliency information and image aesthetic information corresponding to a plurality of training images to obtain the image evaluation network model.

Optionally, when performing model training to obtain the image evaluation network model according to at least one of image saliency information and image aesthetic information corresponding to multiple training images, the processor 810 is further configured to: obtain each The image aesthetic information of the multiple cropping candidate regions corresponding to the training image, the image aesthetic information includes the labeling score and the prediction score of the cropping candidate regions; according to the multiple cropping corresponding to the multiple training images At least one of the image aesthetic information of the candidate area and the image saliency information respectively corresponding to the plurality of training images is used for model training to obtain the image evaluation network model.

Optionally, when acquiring the image aesthetic information of the plurality of cropping candidate regions corresponding to each of the training images, the processor 810 is further configured to: for each of the training images, acquire The corresponding multiple cropping candidate regions are screened at least twice to obtain the screening results, and according to the screening results, the respective labeling scores corresponding to the multiple cropping candidate regions are determined; for each of the training images, the The feature map of the training image, extracting the RoI features and RoD features of the clipping candidate regions on the feature map and combining them into target features, and obtaining the respective prediction scores corresponding to the multiple cropping candidate regions according to the target features.

Optionally, perform model training according to at least one of the image aesthetic information of the plurality of cropping candidate regions respectively corresponding to the plurality of training images, and the image saliency information corresponding to the plurality of training images respectively, to obtain When the image evaluates the network model, the processor 810 is also configured to perform one of the following solutions: perform aesthetic evaluation task training according to the labeling scores and prediction scores of the multiple cropping candidate regions corresponding to the multiple training images respectively , determine the aesthetic evaluation task model, the aesthetic evaluation task model is the image evaluation network model; perform saliency task training according to the saliency grayscale images and saliency map prediction results respectively corresponding to a plurality of the training images, and determine A saliency task model, the saliency task model is the image evaluation network model, and the image saliency information includes a saliency grayscale map and a saliency map prediction result; according to a plurality of training images respectively corresponding to Clipping the labeling score and prediction score of the candidate area and the image saliency information respectively corresponding to the plurality of training images, performing model training, and obtaining the image evaluation network model.

Optionally, when the saliency task model is the image evaluation network model, the image features corresponding to the target image are input into the image evaluation network model, and the features corresponding to the plurality of clipping candidate regions are respectively obtained. When scoring, the processor 810 is also configured to: input the image features corresponding to the target image into the saliency task model, and obtain the saliency feature information corresponding to each pixel of the target image; for each of the target images The clipping candidate area, according to the salient feature information corresponding to the pixel points included in the clipping candidate area, determines the feature score corresponding to the clipping candidate area.

Optionally, when performing aesthetic evaluation task training and determining the aesthetic evaluation task model according to the labeling scores and prediction scores of the plurality of clipping candidate regions respectively corresponding to the plurality of training images, the processor 810 is further configured to: For each of the training images, an aesthetic evaluation task loss is determined according to the annotation scores and prediction scores of multiple cropping candidate regions corresponding to the training image; according to the aesthetic evaluation task loss, a model of the aesthetic evaluation task model is updated. Parameters for model training to determine the aesthetic evaluation task model.

Optionally, when performing saliency task training and determining a saliency task model according to the saliency grayscale images and saliency map prediction results respectively corresponding to the plurality of training images, the processor 810 is further configured to: for each For the training image, determine the saliency task loss according to the saliency grayscale image corresponding to the training image and the saliency map prediction result; update the model parameters of the saliency task model according to the saliency task loss to perform Model training determines the saliency task model.

Optionally, model training is performed according to the labeling scores and prediction scores of the plurality of cropping candidate regions corresponding to the plurality of training images and the image saliency information corresponding to the plurality of training images respectively, to obtain the When evaluating the image network model, the processor 810 is also configured to: determine the aesthetic evaluation task model according to the labeling scores and prediction scores of the multiple cropping candidate regions corresponding to the multiple training images; determining the saliency task model corresponding to the saliency grayscale image and the saliency map prediction result respectively, and the image saliency information includes the saliency grayscale image and the saliency map prediction result; based on the aesthetic evaluation task model and the The salient task model is jointly trained to obtain the image evaluation network model.

Optionally, model training is performed according to the labeling scores and prediction scores of the plurality of cropping candidate regions corresponding to the plurality of training images and the image saliency information corresponding to the plurality of training images respectively, to obtain the When the image evaluates the network model, the processor 810 is further configured to: for each of the training images, generate salient image features according to the training image and the saliency grayscale image corresponding to the training image, and the image saliency information Include the salient image features, and the salient image features include the RoI features and RoD features of the clipping candidate region; for each of the training images, update the corresponding training images according to the salient image features corresponding to the training images Prediction scores of the plurality of cropping candidate regions; performing model training according to the labeling scores and updated prediction scores of the plurality of cropping candidate regions respectively corresponding to the plurality of training images, and obtaining the image evaluation network model.

In this way, by determining a plurality of cropping candidate regions corresponding to the target image, the image features corresponding to the target image including the first image feature associated with the cropping candidate region and the second image feature associated with the non-cropping candidate region are obtained, and the corresponding target image Input the image features of the image evaluation network model into the image evaluation network model, obtain the feature scores of at least one of the aesthetic features and salient features corresponding to the multiple cropping candidate regions, and determine at least one target according to the feature scores corresponding to the multiple cropping candidate regions Cropping the candidate area and cropping the target image can efficiently mine the information in the image based on aesthetic and/or salient features, so as to ensure the image cropping effect and obtain a cropped image with good image quality.

It should be understood that, in the embodiment of the present application, the input unit 804 may include a graphics processor (Graphics Processing Unit, GPU) 8041 and a microphone 8042, and the graphics processor 8041 is used for the image captured by the image capture device in the video capture mode or the image capture mode. (such as a camera) to process the image data of still pictures or videos. The display unit 806 may include a display panel 8061, and the display panel 8061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 807 includes at least one of a touch panel 8071 and other input devices 8072 . The touch panel 8071 is also called a touch screen. The touch panel 8071 may include two parts, a touch detection device and a touch controller. Other input devices 8072 may include, but are not limited to, physical keyboards, function keys (such as volume control buttons, switch buttons, etc.), trackballs, mice, and joysticks, which will not be repeated here. Memory 809 may be used to store software programs as well as various data, including but not limited to application programs and operating systems. The processor 810 may integrate an application processor and a modem processor, wherein the application processor mainly processes operating systems, user pages, and application programs, and the modem processor mainly processes wireless communications. It can be understood that the foregoing modem processor may not be integrated into the processor 810 .

The memory 809 can be used to store software programs as well as various data. The memory 809 may mainly include a first storage area for storing programs or instructions and a second storage area for storing data, wherein the first storage area may store an operating system, an application program or instructions required by at least one function (such as a sound playing function, image playback function, etc.), etc. Furthermore, memory 809 may include volatile memory or nonvolatile memory, or, memory 809 may include both volatile and nonvolatile memory. Among them, the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electronically programmable Erase Programmable Read-Only Memory (Electrically EPROM, EEPROM) or Flash. Volatile memory can be random access memory (Random Access Memory, RAM), static random access memory (Static RAM, SRAM), dynamic random access memory (Dynamic RAM, DRAM), synchronous dynamic random access memory (Synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDRSDRAM), enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM), synchronous connection dynamic random access memory (Synch link DRAM , SLDRAM) and Direct Memory Bus Random Access Memory (Direct Rambus RAM, DRRAM). The memory 809 in the embodiment of the present application includes but is not limited to these and any other suitable types of memory.

The processor 810 may include one or more processing units; optionally, the processor 810 integrates an application processor and a modem processor, wherein the application processor mainly processes operations related to the operating system, user interface, and application programs, etc., Modem processors mainly process wireless communication signals, such as baseband processors. It can be understood that the foregoing modem processor may not be integrated into the processor 810 .

The embodiment of the present application also provides a readable storage medium. The readable storage medium stores programs or instructions. When the program or instructions are executed by the processor, the various processes of the above-mentioned image cropping method embodiments can be achieved, and the same To avoid repetition, the technical effects will not be repeated here.

Wherein, the processor is the processor in the electronic device described in the above embodiments. The readable storage medium includes a computer-readable storage medium, such as a computer read-only memory ROM, a random access memory RAM, a magnetic disk or an optical disk, and the like.

The embodiment of the present application further provides a chip, the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is used to run programs or instructions to implement the above image cropping method embodiment Each process can achieve the same technical effect, so in order to avoid repetition, it will not be repeated here.

It should be understood that the chips mentioned in the embodiments of the present application may also be called system-on-chip, system-on-chip, system-on-a-chip, or system-on-a-chip.

The embodiment of the present application provides a computer program product, the program product is stored in a storage medium, and the program product is executed by at least one processor to implement the various processes in the above image cropping method embodiment, and can achieve the same technical effect , to avoid repetition, it will not be repeated here.

It should be noted that, in this document, the term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, It also includes other elements not expressly listed, or elements inherent in the process, method, article, or device. Without further limitations, an element defined by the phrase "comprising a ..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising that element. In addition, it should be pointed out that the scope of the methods and devices in the embodiments of the present application is not limited to performing functions in the order shown or discussed, and may also include performing functions in a substantially simultaneous manner or in reverse order according to the functions involved. Functions are performed, for example, the described methods may be performed in an order different from that described, and various steps may also be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

Through the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is better implementation. Based on such an understanding, the technical solution of the present application can be embodied in the form of computer software products, which are stored in a storage medium (such as ROM/RAM, magnetic disk, etc.) , optical disc), including several instructions to enable a terminal (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods described in various embodiments of the present application.

The embodiments of the present application have been described above in conjunction with the accompanying drawings, but the present application is not limited to the above-mentioned specific implementations. The above-mentioned specific implementations are only illustrative and not restrictive. Those of ordinary skill in the art will Under the inspiration of this application, without departing from the purpose of this application and the scope of protection of the claims, many forms can also be made, all of which belong to the protection of this application.

Claims

An image cropping method, including:

Determining multiple cropping candidate regions corresponding to the target image;

Acquiring image features corresponding to the target image, where the image features include first image features and second image features, the first image features are associated with the first image areas corresponding to the plurality of cropping candidate areas, and the second A second image feature is associated with a second image area in the target image other than the first image area;

Input the image features corresponding to the target image into the image evaluation network model, and obtain the feature scores corresponding to the plurality of cropping candidate regions respectively, and the feature scores are used to characterize the aesthetic features and salient features of the cropping candidate regions at least one;

Determine at least one candidate target cropping area according to feature scores respectively corresponding to the multiple candidate cropping areas, and crop the target image according to the candidate target cropping areas.
The method according to claim 1, wherein said determining a plurality of cropping candidate regions corresponding to the target image comprises:

Divide the target image into a grid anchor form;

determining at least one target grid in the target image in the form of grid anchors based on preset composition principles;

For the at least one target grid, respectively expand according to at least one expansion ratio, and determine the plurality of clipping candidate regions.
The method according to claim 1, further comprising:

According to at least one of image saliency information and image aesthetic information corresponding to the plurality of training images, model training is performed to obtain the image evaluation network model.
The method according to claim 3, wherein, performing model training according to at least one of image saliency information and image aesthetic information corresponding to a plurality of training images to obtain the image evaluation network model, comprising:

Acquiring the image aesthetic information of a plurality of cropping candidate regions corresponding to each of the training images, the image aesthetic information including labeling scores and prediction scores of the cropping candidate regions;

performing model training according to at least one of the image aesthetic information of the plurality of cropping candidate regions corresponding to the plurality of training images and the image saliency information respectively corresponding to the plurality of training images, and obtaining the image evaluation network model.
The method according to claim 4, wherein said obtaining the image aesthetic information of a plurality of cropping candidate regions corresponding to each of said training images comprises:

For each of the training images, obtain the screening results obtained by the labeling staff for at least two screenings of the multiple cropping candidate regions corresponding to the training image, and determine according to the screening results that the multiple cropping candidate regions correspond to label score;

For each of the training images, obtain the feature map of the training image, extract the RoI features and RoD features of the cropping candidate region on the feature map and combine them into target features, and obtain the multiple features according to the target features. The prediction scores corresponding to the cropping candidate regions respectively.
The method according to claim 5, wherein, according to the image aesthetic information of the plurality of cropping candidate regions corresponding to the plurality of training images, the image saliency information corresponding to the plurality of training images respectively At least one item is to perform model training to obtain the image evaluation network model, including one of the following schemes:

Perform aesthetic evaluation task training according to the labeling scores and prediction scores of the plurality of cropping candidate regions respectively corresponding to the plurality of training images, and determine the aesthetic evaluation task model, where the aesthetic evaluation task model is the image evaluation network model;

Perform saliency task training according to the saliency grayscale images and saliency map prediction results respectively corresponding to a plurality of training images, and determine a saliency task model, where the saliency task model is the image evaluation network model, and the saliency task model is determined. Image saliency information includes saliency grayscale map and saliency map prediction results;

Performing model training to acquire the image evaluation network model according to the annotation scores and prediction scores of the plurality of cropping candidate regions corresponding to the plurality of training images and the image saliency information corresponding to the plurality of training images respectively.
The method according to claim 6, wherein, when the saliency task model is the image evaluation network model, the image features corresponding to the target image are input into the image evaluation network model, and the multiple The feature scores corresponding to the cropping candidate regions, including:

inputting image features corresponding to the target image into the saliency task model, and obtaining saliency feature information corresponding to each pixel of the target image;

For each cropping candidate area of the target image, according to the salient feature information corresponding to the pixels included in the cropping candidate area, determine the feature score corresponding to the cropping candidate area.
The method according to claim 6, wherein the aesthetic evaluation task training is performed according to the labeling scores and prediction scores of the plurality of cropping candidate regions respectively corresponding to the plurality of training images, and the aesthetic evaluation task model is determined, including :

For each of the training images, the aesthetic evaluation task loss is determined according to the annotation scores and prediction scores of multiple cropping candidate regions corresponding to the training images;

According to the aesthetic evaluation task loss, update the model parameters of the aesthetic evaluation task model, so as to perform model training to determine the aesthetic evaluation task model.
The method according to claim 6, wherein, performing saliency task training and determining a saliency task model according to the saliency grayscale images and saliency map prediction results respectively corresponding to a plurality of said training images, comprising:

For each of the training images, determining the saliency task loss according to the saliency grayscale map corresponding to the training image and the saliency map prediction result;

According to the saliency task loss, the model parameters of the saliency task model are updated to perform model training to determine the saliency task model.
The method according to claim 6, wherein, according to the labeling scores and prediction scores of the plurality of cropping candidate regions corresponding to the plurality of training images and the image saliency information corresponding to the plurality of training images respectively , perform model training, and obtain the image evaluation network model, including:

Determining an aesthetic evaluation task model according to the labeling scores and prediction scores of a plurality of cropping candidate regions respectively corresponding to a plurality of training images;

determining a saliency task model according to the saliency grayscale maps and saliency map prediction results respectively corresponding to the plurality of training images, and the image saliency information includes the saliency grayscale map and saliency map prediction results;

Joint training is performed based on the aesthetic evaluation task model and the saliency task model to obtain the image evaluation network model.
The method according to claim 6, wherein, according to the labeling scores and prediction scores of the plurality of cropping candidate regions corresponding to the plurality of training images and the image saliency information corresponding to the plurality of training images respectively , perform model training, and obtain the image evaluation network model, including:

For each of the training images, generating salient image features according to the training image and the salient grayscale image corresponding to the training image, the image salient information includes the salient image features, and the salient image features Including the RoI feature and RoD feature of the clipping candidate region;

For each of the training images, updating the prediction scores of the plurality of cropping candidate regions corresponding to the training images according to the salient image features corresponding to the training images;

Perform model training according to the labeling scores and updated prediction scores of the multiple cropping candidate regions respectively corresponding to the multiple training images, and acquire the image evaluation network model.
An image cropping device, including:

A determining module, configured to determine a plurality of cropping candidate regions corresponding to the target image;

The first acquisition module is configured to acquire image features corresponding to the target image, the image features include first image features and second image features, and the first image features correspond to the first cropping candidate regions. Image area association, the second image feature is associated with a second image area in the target image other than the first image area;

The second acquisition module is configured to input the image features corresponding to the target image into the image evaluation network model, and acquire feature scores respectively corresponding to the plurality of cropping candidate regions, and the feature scores are used to characterize the aesthetics of the cropping candidate regions at least one of characteristics and distinctive features;

A processing module, configured to determine at least one candidate target cropping area according to the feature scores corresponding to the plurality of candidate cropping areas, and crop the target image according to the candidate target cropping area.
An electronic device, which includes a processor and a memory, the memory stores programs or instructions that can run on the processor, and when the programs or instructions are executed by the processor, any of claims 1 to 11 can be realized. A step of the described image cropping method.
A readable storage medium, wherein a program or an instruction is stored on the readable storage medium, and when the program or instruction is executed by a processor, the image cropping method according to any one of claims 1-11 is realized.
A chip, wherein the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is used to run a program or an instruction to implement the method described in any one of claims 1-11. Image cropping method.
A computer program product, wherein the program product is stored in a non-volatile storage medium, and the program product is executed by at least one processor to implement the image cropping method according to any one of claims 1-11 .
An electronic device, wherein the electronic device is configured to execute the image cropping method according to any one of claims 1-11.