CN112839167B

CN112839167B - Image processing method, device, electronic equipment and computer readable medium

Info

Publication number: CN112839167B
Application number: CN202011613934.4A
Authority: CN
Inventors: 尹康
Original assignee: Oppo Chongqing Intelligent Technology Co Ltd
Current assignee: Oppo Chongqing Intelligent Technology Co Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2023-06-30
Anticipated expiration: 2040-12-30
Also published as: CN112839167A

Abstract

The application discloses an image processing method, an image processing device, electronic equipment and a computer readable medium, and relates to the technical field of images, wherein the method comprises the following steps: acquiring an image to be processed; cutting the image to be processed according to a preset proportion based on the size parameter of the image to be processed to obtain a plurality of sub-images; determining an aesthetic score for each of the sub-images by means of a pre-trained evaluation model; a target sub-image is determined from a plurality of the sub-images based on the aesthetic score of each of the sub-images. The image obtained by the clipping mode is more and not limited to the image which must comprise the target main body, and the aesthetic score can evaluate the aesthetic feeling of the sub-image, so that the reference sub-image is richer by searching the target image for each aesthetic score in the sub-image, and the target sub-image is determined by adopting the evaluation model, so that the aesthetic feeling of the mechanized evaluation image is more reasonable compared with the aesthetic feeling of the mechanized evaluation image which is only determined by a plurality of rules.

Description

Image processing method, device, electronic equipment and computer readable medium

Technical Field

The present invention relates to the field of image technology, and more particularly, to an image processing method, an image processing device, an electronic device, and a computer readable medium.

Background

The current common intelligent picture clipping method is divided into two stages, namely, the first step is to detect the main body position (such as main characters, main buildings and the like) in the picture, and the second step is to select a subarea conforming to the photographing rule (such as a 'trisection method') around the detected main body for clipping so as to select a picture with aesthetic feeling. However, the human perception system is quite complex, and it is impossible to reflect the aesthetic feeling of the user by using a few simple preset rules, so that the picture selected by the method is not accurate enough.

Disclosure of Invention

The application provides an image processing method, an image processing device, an electronic device and a computer readable medium, so as to improve the defects.

In a first aspect, an embodiment of the present application provides an image processing method, including: acquiring an image to be processed; cutting the image to be processed according to a preset proportion based on the size parameter of the image to be processed to obtain a plurality of sub-images; determining an aesthetic score for each of the sub-images by means of a pre-trained evaluation model; a target sub-image is determined from a plurality of the sub-images based on the aesthetic score of each of the sub-images.

In a second aspect, an embodiment of the present application further provides an image processing apparatus, including: the device comprises an acquisition unit, a determination unit, an evaluation unit and a processing unit. And the acquisition unit is used for acquiring the image to be processed. The determining unit is used for cutting the image to be processed according to a preset proportion based on the size parameter of the image to be processed and the size parameter of the image to be processed to obtain a plurality of sub-images. And the evaluation unit is used for determining the aesthetic score of each sub-image through a pre-trained evaluation model. A processing unit for determining a target sub-image from a plurality of said sub-images based on the aesthetic score of each said sub-image.

In a third aspect, an embodiment of the present application further provides an electronic device, including: one or more processors; a memory; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the above-described method.

In a fourth aspect, embodiments of the present application also provide a computer readable storage medium storing program code executable by a processor, the program code when executed by the processor causing the processor to perform the above method.

The image processing method, the device, the electronic equipment and the computer readable medium provided by the application comprise the steps of firstly, cutting an image, namely, obtaining a plurality of sub-images based on the size parameter of the image to be processed and cutting the image to be processed according to a preset proportion, and determining the aesthetic score of each sub-image through a pre-trained evaluation model, wherein compared with the method of extracting a target main body in the image and then cutting based on the position of the main body, the image obtained by the cutting mode of the application is more and not limited to the method of necessarily comprising the target main body; a target sub-image is determined from a plurality of the sub-images based on the aesthetic score of each of the sub-images. Because the aesthetic scores can evaluate the aesthetic feeling of the sub-images, the reference sub-images are richer by searching the target sub-images for each aesthetic score in the sub-images, and the target sub-images are determined by adopting an evaluation model, so that the aesthetic feeling of the mechanized evaluation images is more reasonable compared with the aesthetic feeling of the mechanized evaluation images by using only a few rules.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a schematic diagram of an application scenario provided in an embodiment of the present application;

FIG. 2 is a flow chart of an image processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an image preview interface provided by an embodiment of the present application;

FIG. 4 shows a schematic diagram of image cutting provided by an embodiment of the present application;

FIG. 5 shows a method flow diagram of an image processing method provided by another embodiment of the present application;

FIG. 6 shows a schematic diagram of a model training process provided by an embodiment of the present application;

FIG. 7 shows a schematic diagram of candidate regions provided by embodiments of the present application;

FIG. 8 shows a schematic diagram of coordinates of candidate regions provided by embodiments of the present application;

fig. 9 shows a block diagram of an image processing apparatus provided in an embodiment of the present application;

Fig. 10 shows a block diagram of an electronic device according to an embodiment of the present application;

fig. 11 shows a storage unit for storing or carrying program codes for implementing the image processing method according to the embodiment of the present application.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application.

With the increasing availability of mobile devices such as mobile phones, recording life drops by taking pictures at hand has become a daily habit for common people. However, due to lack of professional photographic knowledge and skills, pictures taken by ordinary people often have defects in factors such as colors, compositions and the like, so that more time is generally required for post-processing to meet social sharing requirements. In this context, intelligent cropping algorithms for pictures have come to be of interest to the industry, whose tasks can be summarized as: an original picture is input, the optimal subarea is output according to a certain measurement rule, and other parts are cut off.

The current common intelligent picture clipping method is divided into two stages, namely, the first step is to detect the positions of main bodies (such as main characters and main buildings) in the picture, and the second step is to select sub-regions meeting photographic rules (such as a 'trisection method') around the detected main bodies for clipping.

Specifically, the target subject within the image is identified and classified, and specifically, the target subject may be acquired using a target detection algorithm or a target extraction algorithm. The method specifically may be that all contour line information in the image acquired by the image acquisition device is extracted through a target extraction or clustering algorithm, and then the category of the object corresponding to each contour line is found in a pre-learned model, wherein a matching database is used for the learning model pair, a plurality of contour line information and the category corresponding to each contour line information are stored in the matching database, and the category comprises a human body, an animal, a mountain, a river, a lake surface, a building, a road and the like.

For example, when the target subject is an animal, the outline of the target subject and characteristic information such as ears, corners, ears and limbs can be collected. When the target subject is a human body, face feature extraction can be performed on the target subject, wherein the face feature extraction method can comprise a knowledge-based characterization algorithm or an algebraic feature or statistical learning-based characterization method. In addition, when the target subject is a broad landscape such as a lake or a mountain or a grassland, it is possible to determine whether or not the target subject has a long horizontal line, that is, a horizontal line, and if the horizontal line is present, it is determined that the landscape is broad, wherein the detection of the horizontal line can be performed by collecting all the horizontal lines by the contour extraction method, and then selecting a horizontal line fitted by a plurality of the horizontal lines in a relatively concentrated manner as the horizontal line, thereby detecting the broad landscape. It is needless to say that whether or not the target subject is a landscape may be determined based on the color, for example, when a green or earthy yellow color of a region of a relatively concentrated area is detected, it is determined that the target subject is a wide landscape. Similarly, detection of other objects such as rivers, buildings, roads, etc. can also be performed by the above detection algorithm, and will not be described here again.

After the target subject is determined in the image acquired from the image acquisition device, a composition manner is determined according to the target subject, specifically, a category of the target subject may be determined, and then the composition manner is determined according to the category of the target subject. Specifically, the corresponding relation between the category and the composition mode is preset, and the corresponding relation comprises a plurality of categories and the composition mode corresponding to each category.

For example, a three-part patterning method is adopted for a landscape with wide space, a symmetrical patterning method is adopted for a building, a diagonal patterning method is adopted for dynamic objects (people and animals), an S-shaped patterning method is adopted for a small road and a river, and the like. In addition, the method can further comprise a plurality of composition modes such as a vertical composition method and an X-type composition method, and specifically, if the composition modes corresponding to different types of target subjects are determined in advance, the composition auxiliary patterns corresponding to the composition modes and the auxiliary display positions of the composition auxiliary patterns in the image preview interface are determined, so that reasonable composition guiding users can be obtained.

However, the inventors found in the study that the above method has two major problems, namely that the most aesthetic subarea in the picture does not necessarily contain a so-called subject, that the photographing rule is an objective standard summarized by a photographer, and that the so-called aesthetic feeling is a subjective feeling, and that there is no strict correspondence between the two. Therefore, the above method results in that the extracted so-called aesthetic image is simply cut out according to the preset rule, so that the extracted image is too stylized and mechanized to truly satisfy the aesthetic demands of the user.

Therefore, in order to solve the above-mentioned drawbacks, the embodiments of the present application provide an image processing method, apparatus, electronic device, and computer readable medium, which determine a target sub-image from a plurality of sub-images based on the aesthetic score of each sub-image, not only make the reference sub-image richer, but also determine the target sub-image by using an evaluation model, which is more reasonable than the aesthetic sense of a mechanized evaluation image by only several rules.

In order to facilitate understanding of the embodiments of the present application, application scenarios that may be used in the embodiments of the present application are described first, as shown in fig. 1, fig. 1 shows an application scenario provided in the embodiments of the present application, where a server 10 and a user terminal 20 are located in a wireless network or a wired network, and data interaction between the server 10 and the user terminal 20 is enabled.

In some embodiments, when the user logs in at the user terminal 20 through an account, all information corresponding to the account may be stored in the storage space of the server 10. The server 10 may be a single server, a server cluster, a local server, or a cloud server.

Specifically, the user terminal 20 may be a terminal used by a user, through which the user browses images, and may be used as a device for capturing images by the user, and in some embodiments, an image capturing apparatus is disposed in the user terminal. The server 10 may store pictures in the user terminal 20, and in some embodiments, the server 10 may be configured to train the model or algorithm related to the embodiments of the present application, and in addition, the server 10 may also migrate the trained model or algorithm to the user terminal, which may, of course, be configured to train the model or algorithm related to the embodiments of the present application directly by the user terminal 20. Specifically, in the embodiments of the present application, execution subjects of each method step in the embodiments of the present application are not limited.

Referring to fig. 2, fig. 2 shows an image processing method provided in an embodiment of the present application, where an execution body of the method may be the server or the user terminal, and specifically the method includes: s201 to S204.

S201: and acquiring an image to be processed.

As an embodiment, the image to be processed may be an image acquired by the user terminal within the camera application. Specifically, the user uses the camera application in the user terminal to collect the image, and after the shooting of the image is completed, the shot image is used as the image to be processed, so that when the shooting of the user is completed, a reasonable aesthetic image can be cut out for the user automatically according to the shot image of the user.

As another embodiment, the image to be processed may also be a preview image of the camera application of the user terminal, that is, an image collected by the camera of the user terminal for display in the preview interface of the camera application. As shown in fig. 3, fig. 3 shows a schematic diagram of an image preview interface provided in an embodiment of the present application, specifically, the image preview interface corresponds to a camera application installed in a mobile terminal for starting a front camera or a rear camera and controlling the started camera to take a picture, and when a user starts the camera application, the image preview interface is displayed on a screen of the mobile terminal. As shown in fig. 1, the image preview interface is used to display a preview image captured by a camera, which may be an image captured by a first camera or a second camera, and the user can adjust camera parameters including focal length, filters, ISO values, EV values, and other parameters of the camera within the interface of the camera application.

In addition, the user may use the image preview interface as a viewfinder of the camera, and place the object to be photographed in the viewfinder, that is, observe the object to be photographed in the preview image displayed on the image preview interface. Then, the user presses a photographing button or otherwise inputs a photographing instruction, and the camera captures a picture of the object to be photographed.

S202: and cutting the image to be processed according to a preset proportion based on the size parameter of the image to be processed to obtain a plurality of sub-images.

The size parameter of the image to be processed may be an image size of the image to be processed, for example, if the resolution of the image to be processed is w×h, the size parameter is W and H. As an implementation manner, based on the size parameter of the image to be processed and cutting the image to be processed according to a preset proportion, wherein the preset proportion can be set according to actual requirements, in some embodiments, the number X of sub-images to be acquired can be predetermined, then the image to be processed is cut according to the number X, and the sizes of the two obtained images are (W/2×h) respectively assuming that X is 2.

As an embodiment, all the sub-images cover the entire area of the image to be processed. Specifically, it is assumed that each of the sub-images corresponds to a sub-image region within the image to be processed, and all of the sub-image regions constitute the entire region of the image to be processed. For example, as shown in fig. 4, four broken lines cut the image to be processed into 8 pictures. The sub-areas in the image to be processed corresponding to the 8 images together form a total image area of the image to be processed, so that the image data of any pixel unit of the image to be processed is at least located in one sub-image, and the plurality of sub-images can cover the image to be processed as a whole. In some embodiments, each sub-image may be an aliquoting region of the image to be processed, i.e., the size of each sub-image is the same. For example, the size of the image to be processed is 1280×720 pixels, the number of candidate regions (i.e., the number of sub-images) is limited to a fixed value M, and the fixed sub-image of the image to be processed corresponds to (for example, limit m=2, the 1 st candidate region corresponds to the leftmost 720×720 region of the image to be processed, and the 2 nd candidate region corresponds to the rightmost 720×720 region of the image to be processed).

As another implementation manner, the image to be processed may be cut to obtain multiple sub-images by using default box generation rules in the object detection frame SSD.

In addition, the above-mentioned proportional cutting may be an equal proportional cutting, and the above-mentioned proportional cutting may be an unequal proportional cutting, for example, using dense sampling to approximate the traversal original to cut the image to be processed into a plurality of sub-images.

In addition, it should be noted that the above-mentioned sub-image may be an image having an image format, or may be sampled sub-image data of image data corresponding to the image to be processed, for example, the image data of the image to be processed is a matrix L, where the pixel value of each pixel point of the image to be processed in L is the pixel value of each pixel point of the image to be processed, and each sub-image may be R _i The value of i is 0 to M, where M is the number of sub-images. R of each sub-image _i Represented is a sub-matrix corresponding to the matrix L constituted by a portion of the pixels within the matrix L.

As an implementation manner, the sub-image does not necessarily include a target subject or a complete image of the target subject, and the target subject may be an object in the image to be processed, for example, a person, a building, an animal, etc., and then the image of the target subject may be distributed in different sub-images, that is, the image clipping manner provided in the embodiment of the present application is not limited to the image clipping manner that is necessarily performed by taking the target subject as the object, so that the clipped image necessarily includes the target subject or includes the complete target subject, that is, the clipped image that is not taken by taking the target subject as the target subject and then obtained by clipping the image around the target subject includes the target subject.

S203: the aesthetic score of each of the sub-images is determined by a pre-trained evaluation model.

Wherein the aesthetic score corresponds to the aesthetic perception of the image by the user, which may include, in particular, the perception of composition, hue, light, etc. As an embodiment, the evaluation model may be trained in advance such that the trained model is able to determine the score of each sub-image, i.e. the aesthetic score, based on the perception of the user. Specifically, the evaluation model may be trained based on a sample image, where the sample image is a sample marked by a user, that is, the user marks the image based on his own perception, so that the trained evaluation model can obtain the aesthetic score of the sub-image based on the user's perception.

As an embodiment, the evaluation model may be a Neural Network (NN) model, or may be a convolutional Neural network (Convolutional Neural Networks, CNN) model based on deep learning. Specifically, the present invention is not limited thereto.

S204: a target sub-image is determined from a plurality of the sub-images based on the aesthetic score of each of the sub-images.

As one embodiment, the higher the aesthetic score of the sub-image, the more aesthetic perception of the user is characterized by the sub-image. For example, assuming that the aesthetic perception of a user corresponds to a aesthetic coefficient, the higher the aesthetic coefficient of a thing indicates the higher the aesthetic preference of the user for the thing, the aesthetic coefficient is related to the aesthetic view of the user or the value view of the thing, the aesthetic perception of each person is different, and thus, the evaluation model may be for a specific user, i.e., the evaluation models for different users may be different. As one embodiment, the higher the aesthetic score, the higher the aesthetic coefficient, the lower the aesthetic score, and the lower the aesthetic coefficient. In some embodiments, the determining the target sub-image from the plurality of sub-images based on the aesthetic score of each sub-image may be performed by using the sub-image with the highest aesthetic score from the plurality of sub-images as the target sub-image. In other embodiments, determining the target sub-image from among the plurality of sub-images based on the aesthetic score of each sub-image may be performed by ordering all sub-images from high to low in aesthetic score, resulting in a sequence, with the first number of sub-images in the sequence that are top ranked as target sub-images.

Referring to fig. 5, fig. 5 shows an image processing method provided in an embodiment of the present application, where an execution body of the method may be the server or the user terminal, and specifically the method includes: s501 to S507.

S501: a first set of sample images is acquired, each first sample image within the first set of sample images corresponding to a first scoring value, the first scoring value being used to characterize the aesthetic appeal of the user.

As an embodiment, the first sample image set may be a sample image to which the user previously annotates each of the first sample images, specifically, the user previously sets a first score value for each of the first sample images, and then a plurality of first sample images marked with the first score value are taken as the first sample image set. The aesthetic feeling of different users is different, so that the first sample image sets corresponding to different users are different, and therefore, the first sample image sets correspond to the identity information of the users, and the identity information of the users can be the terminal identification of the user terminal used by the users, and can also be the biological characteristic information such as the fingerprint or the face characteristics of the users. The terminal identity of the user terminal may be a processor identity of the user terminal, a MAC address, or other unique identity that can be used to characterize the identity of the user terminal.

As an embodiment, the objective of the evaluation model is to assign aesthetic scores to sub-images, and the first sample image set may also be a plurality of sub-images obtained after cutting of at least one image, so that in order to avoid that the number of sub-images obtained by possible cutting of one image is excessive, if each sub-image is marked by a user, the workload of the user may be increased, and thus, when determining the first sample image set corresponding to the target user, it may be determined whether there is a reference user of the target user, and the reference user corresponds to the first sample image set, and if so, the first sample image set of the reference user is taken as the first sample image set corresponding to the target user.

Specifically, the target user and the reference user satisfy a preset condition. In some embodiments, the preset condition may be that the intimacy degree of the target user and the reference user is greater than a specified value, in other embodiments, the preset condition may be that the user relationship of the target user and the reference user is a specified relationship, and the specified relationship may be a couple relationship, a lover relationship or an intimacy friend relationship, and then the intimacy friend relationship also indicates that the intimacy degree is greater than the specified value.

The server records the friend relation of the user, wherein the friend relation of the user comprises user identifications of each user belonging to the friend relation with the user. The intimacy degree between the user and the friend can reflect the interaction frequency between the user and the friend and the intimacy of the friend relationship, which is equivalent to classification of the social relationship of the user, specifically, the intimacy degree can be a parameter value, and the larger the parameter value is, the higher the intimacy degree between the user and the friend is, and the higher the interactivity is.

As one implementation, the degree of intimacy between a user and various friends may be determined by the user's social information. The social information of the user comprises interaction information of the user and each friend, wherein the interaction information comprises forwarding times, collection times, comment times and grouping information, specifically, the forwarding times, collection times and comment times can be times of forwarding, collecting and comment of the content posted by the friend by the user, and the grouping information can be keywords of a plurality of groups established by the user and user identifiers in the groups.

As an implementation manner, parameters may be set for the forwarding times, the collection times, the comment times and the grouping information, that is, the parameters include a first parameter, a second parameter, a third parameter and a fourth parameter, where the first parameter corresponds to the forwarding times, the second parameter corresponds to the collection times, the third parameter corresponds to the comment times, and the fourth parameter corresponds to the grouping information. For ease of calculation, the first parameter, the second parameter, the third parameter, and the fourth parameter may all be normalized to the [0,1] value range interval.

Specifically, the forwarding times of the user to each user are obtained, namely the forwarding times corresponding to each friend are determined, then the forwarding times of all friends are added to obtain the total forwarding times, and then the forwarding times of each friend are divided by the total forwarding times to obtain a numerical value which is a first parameter of each friend. Similarly, the second parameter and the third parameter may be obtained.

The fourth parameter may be obtained by determining a grouping of the user on each friend, determining a keyword of each grouping, determining a category corresponding to the keyword of the grouping, and determining a first value corresponding to the keyword of the grouping according to scores corresponding to different preset categories, where the scores corresponding to the different categories are different, and some categories, such as a friend, have a higher set score, and for friends not grouped, the score of the default grouping is lower, so that the first value can represent whether the grouping in which the friend is located is a grouping corresponding to the friend with higher intimacy of the user. The first value is normalized and then used as a fourth parameter.

Then, the intimacy degree between the user and each friend is obtained according to the first parameter, the second parameter, the third parameter and the fourth parameter, and as an implementation mode, the first parameter, the second parameter, the third parameter and the fourth parameter can be summed, and the summed result is used as the intimacy degree.

The reference user meeting the preset condition with the target user is found, and the first sample image set of the reference user is used as the first sample image set of the target user, so that the labor cost consumed in generating the sample can be reduced.

As another embodiment, the aesthetic model may be trained in advance, and the first sample image may be scored based on the trained aesthetic model, so that the number of sample images in the first sample image set may be rapidly expanded, and the labor cost consumed in generating the sample may be reduced. Specifically, the embodiment of obtaining the first sample image set may be that a plurality of first sample images are obtained by cutting an original image; and determining a scoring value of each first sample image according to a pre-trained aesthetic model to obtain the first sample image set, wherein the aesthetic model is trained based on a second sample image set, and each second sample image in the second sample image set corresponds to a second scoring value which is a labeling result of a user on the image based on aesthetic cognition. The embodiment of cutting the original image to obtain a plurality of first sample images may refer to the foregoing embodiment of obtaining a plurality of sub-images based on the size parameter of the image to be processed and cutting the image to be processed according to a preset ratio, which is not described herein again. The aesthetic model may be the same network structure as the evaluation model that imparts an aesthetic score to the image, e.g., both the aesthetic model and the evaluation model may be CNN networks.

Referring to fig. 6, as one implementation, the present example will describe the process of aesthetic and evaluation models based on fig. 6. Specifically, as shown in fig. 6, the aesthetic CNN is denoted as an aesthetic model, the cut CNN is an evaluation model, the marked aesthetic dataset is a second sample image set, the unmarked cut dataset is a plurality of first sample images obtained by cutting the original image, and the marked cut dataset is a plurality of first sample images corresponding to the first score value, i.e., a first sample image set.

As shown in fig. 6, a second sample image set is acquired, specifically, a large number of second sample images are collected, and then the second sample images are marked by the user, that is, a second score is given to the second sample images, specifically, the user can score the second sample images based on subjective aesthetic feeling of the user, so as to configure a second score for each second sample image. The public dataset AVA may be used if limited to time or cost cannot be self-built. As an embodiment, the manner of determining the second sample image may refer to the above-mentioned reference user corresponding to the determination target user, and then, the second sample image set of the reference user is taken as the second sample image set of the target user.

And performing preprocessing operation on the configured second sample image with the second score to obtain a second sample image set. Specifically, first, a score normalization operation is performed, that is, the scores of all the second scores are linearly scaled to between 0 and 10, even if the lowest score is 0, and the highest score is 10. Second, scale normalization of the second sample image, scaling all second sample pictures to the first specified size using bilinear interpolation. Specifically, the first specified size may be set based on actual use requirements. For example, the specified size may be 224×224 pixel size. And finally, executing a specified format storage operation, specifically, converting the processed second sample image and the corresponding second score storage into a specified format, and facilitating reading. Wherein the specified format may be a binary format.

Then, a second predicted value for each of the second sample images is determined by the aesthetic model to be trained. Specifically, a second image sample is input into an aesthetic model to be trained, the aesthetic model to be trained outputs a second predicted value corresponding to each second sample image, and the aesthetic model to be trained is trained based on a loss between the second predicted value and a second scoring value of each second sample image to obtain a trained aesthetic model. Specifically, the loss between the second predicted value and the second scoring value may be a regression loss between the second predicted value and the second scoring value, as a deviation between the second predicted value and the second scoring value, that is, the larger the deviation between the two indicates the lower the similarity of the output result of the aesthetic model and the aesthetic perception of the user. Iterative optimization is carried out on the network based on a random gradient descent method, and after convergence of training, the network parameters are solidified to be used as a trained aesthetic model, so that the trained aesthetic model can be applied to training of an evaluation model. Because the process of manually creating the clipping dataset is too cumbersome, only enough non-labeling data needs to be collected, and then the clipping dataset is automatically labeled through an aesthetic model.

Specifically, cutting out the original image to obtain a plurality of first sample images; determining a first grading value of each first sample image according to a pre-trained aesthetic model to obtain the first sample image set, namely a plurality of first sample images corresponding to the first grading values are collectively called as a first sample image set, wherein the aesthetic model is trained based on a second sample image set, each second sample image in the second sample image set is corresponding to a second grading value, and the second grading value is a labeling result of a user on the image based on aesthetic perception.

In this embodiment, the number of the original pictures is a plurality of, and the original pictures are unlabeled images, i.e. are not set with a scoring value corresponding to the aesthetic feeling of the user. As shown in fig. 7, a plurality of candidate areas may be determined on the original image, each candidate area corresponding to one of the first sample images. As an embodiment, the original image is preprocessed before determining the plurality of candidate areas corresponding to the original image, specifically, a size normalization operation is performed on the original image, and all the images are scaled to a second designated size by using a bilinear interpolation method, where the second designated size may be set based on actual use requirements. For example, the second specified size may be 320×320 pixel size.

S502: and determining a first predicted value of each first sample image through an evaluation model to be trained.

S503: training the evaluation model to be trained based on the loss between the first predicted value and the first scoring value of each of the first sample images to obtain a trained evaluation model.

Determining a scoring value for each of said first sample images based on a pre-trained aesthetic model to obtain a first scoring value for each of said first sample images, each corresponding to a first sample having a first scoring valueThe present image forms the first sample image set, specifically, all candidate regions, i.e. the first sample images, are respectively numbered 1,2,3, …, N, wherein N is a positive integer, and the score s of each first sample image is respectively predicted by a trained aesthetic model ₁ ，s ₂ ，s ₃ ，…，s _N Post-cascading is an aesthetic vector s= (S) ₁ ,s ₂ ,s ₃ ,…,s _N ) As labeling information of the corresponding original image, i.e., as a first score value of each first sample image.

Then, determining a first predicted value of each first sample image through an evaluation model to be trained; training the evaluation model to be trained based on the loss between the first predicted value and the first scoring value of each of the first sample images to obtain a trained evaluation model. Specifically, taking the cross entropy loss between the network prediction vector (namely the first prediction value of each candidate region) and the aesthetic vector S of the original image as an optimization target, carrying out iterative optimization on the network based on a random gradient descent method, and curing network parameters after training and convergence as an evaluation model.

As an embodiment, the trained evaluation model may also perform the above-described cutting operation on the image, specifically, input an image to be processed to the evaluation model, the evaluation model may cut out the image to be processed to obtain a plurality of sub-images, and determine the aesthetic score of each sub-image.

S504: and acquiring an image to be processed.

In one embodiment, after the image to be processed is acquired, performing size processing on the image to be processed, and taking the image to be processed after the size processing as the image to be processed corresponding to the target sub-image to be determined this time. Specifically, the image to be processed is scaled to a third specified size by bilinear interpolation based on the size parameter of the image to be processed, and in this embodiment of the application, the third specified size may be set according to practical use, for example, the third specified size may be 320×320 pixels.

S505: and cutting the image to be processed according to a preset proportion based on the size parameter of the image to be processed to obtain a plurality of sub-images.

S506: the aesthetic score of each of the sub-images is determined by a pre-trained evaluation model.

Inputting the image to be processed into an evaluation model, determining a plurality of candidate areas by the evaluation model, taking image data corresponding to each candidate area as sub-images, and obtaining aesthetic scores of each sub-image to obtain a scoring vector P= (P) ₁ ,P ₂ ,P ₃ ,…,P _N ). Wherein P is ₁ Corresponding to aesthetic score of sub-image numbered 1, others P ₂ ,P ₃ ,…,P _N The meaning is the same. Each sub-image corresponds to a position coordinate, which may be set according to a preset candidate region coordinate generation manner. Specifically, assuming that the coordinates are (x, y, w, h), as shown in fig. 8, the preset candidate region coordinate generation method is as follows: the x and y in the position coordinates are coordinates of the vertex of the candidate region corresponding to the sub-image, and as an embodiment, may be pixel coordinates of the vertex in the upper left corner, w is the width of the candidate region, and h is the height of the candidate region.

S507: a target sub-image is determined from a plurality of the sub-images based on the aesthetic score of each of the sub-images.

As one embodiment, the sub-image with the highest aesthetic score in the plurality of sub-images is taken as the target sub-image. The index corresponding to the maximum value is recorded as k, and the coordinates (x) of the kth candidate region are output according to a preset candidate region generation mode _k ,y _k ,w _k ,h _k ) And marking the candidate region corresponding to the sub-image with the highest aesthetic score as the optimal sub-region, and taking the image corresponding to the optimal sub-region as the target sub-image.

Then, determining coordinates of a target sub-image in the image to be processed; and clipping the target sub-image based on the coordinates of the target sub-image to obtain a clipped image from the image to be processed, wherein the clipped image is used as a designated image, and the designated image is an image which is finally extracted from the image to be processed and meets the aesthetic perception of a user.

Specifically, a specified image is determined based on the coordinates of the sub-image with the highest aesthetic score, specifically, the specified coordinates (X, Y, W, H) of the optimal sub-region in the image to be processed are calculated according to the following formula (1), and cut out and output to obtain the specified image.

In the formula (1), width is the Width of the image after the image to be processed is subjected to the size processing, and Height is the Height of the image after the image to be processed is subjected to the size processing. 320 in the formula of X and W indicates that the width in the third specified dimension is 320, and if the width in the third specified dimension is another value, 320 in the formula of X and W is also changed accordingly, 320 in the formula of Y and H indicates that the height in the third specified dimension is 320, and if the height in the third specified dimension is another value, 320 in the formula of Y and H is also changed accordingly.

As an implementation manner, when the image to be processed is a preview image corresponding to a camera application of the user terminal, and the user uses the user terminal to take a picture, the user terminal may open the camera application, and under the condition that the camera application displays a preview interface, control the camera to collect images of other scenes around the scene to be shot by the user, so as to obtain an image with a larger range surrounding the shot scene, and then combine all the images into one image, for example, the current panoramic image combining manner, using the combined image as the image to be processed, executing the method to obtain a specified image, and storing the specified image in an album, for example.

As an embodiment, after determining the specified image, the specified image may be further subjected to optimization processing to obtain a high-definition visual effect of the specified image. Specifically, a display enhancement process is performed on a specified image, wherein the display enhancement process specifically includes optimizing image parameters for optimizing the image quality of the image. The image quality includes parameters determining viewing effects such as sharpness, lens distortion, color, resolution, color gamut range, and purity. The combination of different parameters can achieve different display enhancement effects, for example, barrel distortion effect with the position of the portrait as the center, and the effect of terrorist atmosphere can be created by modifying the tone of the current picture to gray.

In the practice of the present application, the image parameter optimization includes at least one of exposure enhancement, denoising, edge sharpening, contrast enhancement, or saturation enhancement.

The exposure degree enhancement is used for enhancing the brightness of an image, and the brightness value can be increased by a histogram of the image, or the brightness of the image can be increased by nonlinear superposition, specifically, I represents a darker image to be processed, T represents a relatively bright image after processing, and the exposure degree enhancement is performed in such a way that T (x) =i (x) + (1-I (x)) ×i (x). Wherein, T and I are both images with values of [0,1 ]. The algorithm may iterate multiple times if one is not good.

Among them, denoising image data is used to remove noise of images, and in particular, images are often degraded due to interference and influence of various noises during generation and transmission, which adversely affects processing of subsequent images and visual effects of images. Noise is of a wide variety, such as electrical noise, mechanical noise, channel noise and other noise. Therefore, in order to suppress noise, improve image quality, facilitate higher-level processing, it is necessary to perform denoising preprocessing on an image. From the probability distribution of noise, it can be classified into gaussian noise, rayleigh noise, gamma noise, exponential noise, and uniform noise.

Specifically, the image can be denoised by a gaussian filter, which is a linear filter that can effectively suppress noise and smooth the image. The principle of action is similar to that of an average filter, and the average value of pixels in a filter window is taken as output. The coefficients of the window templates are different from those of the average filter, and the template coefficients of the average filter are the same and are 1; whereas the template coefficients of the gaussian filter decrease with increasing distance from the template center. Therefore, the gaussian filter has a smaller degree of blurring of the image than the mean filter.

For example, a 5×5 gaussian filter window is generated, and sampling is performed with the center position of the template as the origin of coordinates. And (3) bringing the coordinates of each position of the template into a Gaussian function, wherein the obtained value is the coefficient of the template. The gaussian filter window is then convolved with the image to denoise the image.

Edge sharpening is used to sharpen blurred images, among other things. Image sharpening generally has two methods: one is a differential method, and the other is a high-pass filtering method.

Among them, contrast increase is used to enhance the image quality of an image so that colors within the image are more vivid, and specifically, contrast stretching is a method of image enhancement, and also belongs to a gradation conversion operation. By stretching the gray values through gray scale transformation to the whole interval 0-255, the contrast is obviously greatly enhanced. The gray value of a pixel can be mapped to a larger gray space with the following formula:

I(x,y)＝[(I(x,y)-Imin)/(Imax-Imin)](MAX-MIN)+MIN；

where Imin, imax are the minimum and maximum gray values of the original image, and MIN and MAX are the gray minimum and maximum values of the gray space to be stretched.

As an embodiment, the region to be optimized may be determined based on the specified image, and the above-described display enhancement processing may be performed on the optimized region. The area to be optimized may be determined based on the content of the specified image, specifically, a target object in the specified image is determined, the type of the target object is determined, for example, a male, a female, a sky, a mountain, a river, a signboard, or the like, the area of the target object is taken as the area to be optimized, and then the area of the target object is optimized based on the type of the target object, for example, strategies of display enhancement corresponding to different target object types are different.

As another embodiment, the area to be optimized may be an area determined by a user within a designated image, specifically, after the user selects a requirement of display enhancement processing, for example, inputs a display enhancement instruction, the designated image is displayed on a screen of the user terminal, and the user selects the area to be optimized on the designated image, specifically, the user presses the designated image, the area pressed by the user is taken as the area to be optimized, or the area of a target object within the image pressed by the user is taken as the area to be optimized.

Therefore, in the embodiment of the application, the evaluation index in the traditional intelligent cutting method is replaced by subjective aesthetic feeling of a person through the objective shooting principle, so that the output result has subjective aesthetic feeling. And a single-stage clipping flow is designed, the original image is approximately traversed through dense sampling, the main body detection algorithm is not relied on, and the method is more universal.

Referring to fig. 9, which is a block diagram illustrating an image processing apparatus according to an embodiment of the present application, the apparatus 900 may include: the device comprises an acquisition unit, a determination unit, an evaluation unit and a processing unit.

And the acquisition unit is used for acquiring the image to be processed.

The determining unit is used for cutting the image to be processed according to a preset proportion based on the size parameter of the image to be processed and the size parameter of the image to be processed to obtain a plurality of sub-images.

As an embodiment, all the sub-images cover the entire area of the image to be processed. The size of each of the sub-images is the same.

And the evaluation unit is used for determining the aesthetic score of each sub-image through a pre-trained evaluation model.

A processing unit for determining a target sub-image from a plurality of said sub-images based on the aesthetic score of each said sub-image.

Further, the processing unit is further configured to use a sub-image with the highest aesthetic score of the plurality of sub-images as a target sub-image.

Further, the device also comprises an extraction unit for determining coordinates of the target sub-image in the image to be processed; and clipping the image to be processed based on the coordinates of the target sub-image to obtain a clipped image.

Further, the apparatus includes a training unit for acquiring a first set of sample images, each of the first sample images within the first set of sample images corresponding to a first scoring value, the first scoring value being used to characterize the aesthetic appeal of the user; determining a first predicted value of each first sample image through an evaluation model to be trained; training the evaluation model to be trained based on the loss between the first predicted value and the first scoring value of each of the first sample images to obtain a trained evaluation model.

Specifically, the training unit is further used for clipping the original image to obtain a plurality of first sample images; determining a first grading value of each first sample image according to a pre-trained aesthetic model to obtain the first sample image set, wherein the aesthetic model is trained based on second sample image sets, and each second sample image in the second sample image sets is corresponding to a second grading value which is a labeling result of a user on the images based on aesthetic cognition.

Specifically, the training unit is further configured to acquire a second sample image set; determining a second predicted value for each of the second sample images by means of the aesthetic model to be trained; the aesthetic model to be trained is trained based on the loss between the second predicted value and a second scoring value for each of the second sample images to obtain a trained aesthetic model.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus and modules described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

In several embodiments provided herein, the coupling of the modules to each other may be electrical, mechanical, or other.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.

Referring to fig. 10, a block diagram of an electronic device according to an embodiment of the present application is shown. The electronic device 100 may be a smart phone, a tablet computer, an electronic book, or the like capable of running an application program. The electronic device 100 in the present application may be the server described above, or may be the user terminal described above. In particular, the electronic device may include one or more of the following components: a processor 110, a memory 120, and one or more application programs, wherein the one or more application programs may be stored in the memory 120 and configured to be executed by the one or more processors 110, the one or more program(s) configured to perform the method as described in the foregoing method embodiments.

Processor 110 may include one or more processing cores. The processor 110 utilizes various interfaces and lines to connect various portions of the overall electronic device 100, perform various functions of the electronic device 100, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 120, and invoking data stored in the memory 120. Alternatively, the processor 110 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 110 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 110 and may be implemented solely by a single communication chip.

The Memory 120 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (Read-Only Memory). Memory 120 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 120 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described below, etc. The storage data area may also store data created by the terminal 100 in use (such as phonebook, audio-video data, chat-record data), etc.

Referring to fig. 11, a block diagram of a computer readable storage medium according to an embodiment of the present application is shown. The computer readable medium 1100 has stored therein program code that can be invoked by a processor to perform the methods described in the method embodiments above.

The computer readable storage medium 1100 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, computer readable storage medium 1100 includes non-volatile computer readable medium (non-transitory computer-readable storage medium). The computer readable storage medium 1100 has storage space for program code 1110 that performs any of the method steps described above. The program code can be read from or written to one or more computer program products. Program code 1110 may be compressed, for example, in a suitable form.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, one of ordinary skill in the art will appreciate that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not drive the essence of the corresponding technical solutions to depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. An image processing method, comprising:

acquiring an image to be processed, wherein the image to be processed is an image acquired by a camera of a user terminal and displayed in an image preview interface of a camera application of the user terminal under the condition that the image preview interface is displayed on a screen of the user terminal;

cutting the image to be processed according to a preset proportion based on the size parameter of the image to be processed to obtain a plurality of sub-images; determining an aesthetic score of each sub-image through a pre-trained evaluation model, wherein the evaluation model is a model trained through a first sample image set, the first sample image set corresponds to a terminal identifier of the user terminal, the first sample image set comprises a plurality of first sample images, each first sample image corresponds to a first scoring value, the first scoring value is used for representing aesthetic feeling of a target user, and the target user is a user of the user terminal;

Determining a target sub-image from a plurality of said sub-images based on the aesthetic score of each said sub-image;

wherein prior to determining the aesthetic score of each of the sub-images by a pre-trained evaluation model, further comprising:

if the existence of a reference user corresponding to the target user and the first sample image set corresponding to the reference user are determined, the first sample image set of the reference user is used as the first sample image set corresponding to the target user, wherein the intimacy degree of the target user and the reference user is larger than a specified numerical value;

acquiring a first sample image set of the target user;

determining a first predicted value of each first sample image in a first sample image set of the target user through an evaluation model to be trained;

and training the evaluation model to be trained based on the loss between each first predicted value and each first scoring value to obtain a trained evaluation model.

2. The method according to claim 1, wherein all of the sub-images cover the entire area of the image to be processed.

3. The method of claim 1, further comprising, after determining a target sub-image from a plurality of the sub-images based on the aesthetic score of each of the sub-images:

Determining coordinates of a target sub-image in the image to be processed;

and clipping the image to be processed based on the coordinates of the target sub-image to obtain a clipped image.

4. A method according to any one of claims 1-3, wherein said determining a target sub-image from a plurality of said sub-images based on the aesthetic score of each of said sub-images comprises:

and taking the sub-image with the highest aesthetic score in the plurality of sub-images as a target sub-image.

5. An image processing apparatus, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring an image to be processed, and the image to be processed is an image acquired by a camera of a user terminal and displayed in an image preview interface of a camera application of the user terminal under the condition that the image preview interface is displayed on a screen of the user terminal;

the determining unit is used for cutting the image to be processed according to a preset proportion based on the size parameter of the image to be processed to obtain a plurality of sub-images;

an evaluation unit for determining an aesthetic score of each sub-image through a pre-trained evaluation model, wherein the evaluation model is a model trained through a first sample image set, the first sample image set corresponds to a terminal identifier of the user terminal, the first sample image set comprises a plurality of first sample images, each first sample image corresponds to a first score value, the first score value is used for representing aesthetic feeling of a target user, and the target user is a user of the user terminal;

A processing unit for determining a target sub-image from a plurality of said sub-images based on the aesthetic score of each of said sub-images;

the device further comprises a training unit, wherein the training unit is used for taking the first sample image set of the reference user as the first sample image set corresponding to the target user if the reference user corresponding to the target user exists and the first sample image set corresponding to the reference user exists, and the intimacy degree of the target user and the reference user is larger than a specified value; acquiring a first sample image set of the target user; determining a first predicted value of each first sample image in a first sample image set of the target user through an evaluation model to be trained; and training the evaluation model to be trained based on the loss between each first predicted value and each first scoring value to obtain a trained evaluation model.

6. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the method of any of claims 1-4.

7. A computer readable medium, characterized in that the computer readable medium stores a program code executable by a processor, which program code, when executed by the processor, causes the processor to perform the method of any of claims 1-4.