CN113516666A

CN113516666A - Image cropping method and device, computer equipment and storage medium

Info

Publication number: CN113516666A
Application number: CN202011644040.1A
Authority: CN
Inventors: 张考; 李松南
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-10-19

Abstract

The embodiment of the application discloses an image cropping method and device, computer equipment and a storage medium. The method comprises the following steps: acquiring a target image to be processed, and determining a cutting frame for cutting the target image; and performing visual saliency prediction on the target image to obtain saliency information of the target image, determining a target position of the cutting frame in the target image according to the saliency information, and cutting the target image by adopting the cutting frame at the target position to obtain a cut image. By the method and the device, the cut image can include the predicted significance region as much as possible, so that the cut image can better attract the attention of a user, and the viscosity of the user is improved.

Description

Image cropping method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image cropping method, an image cropping device, an image cropping apparatus, and a computer-readable storage medium.

Background

With the continuous development of computer technology, a great number of pictures appear in a network. In the case where the required image size does not match the original image size (for example, in the case where the original image is displayed, the size of the display area does not match the original image size), it is necessary to perform the cropping process on the original image. Common cropping methods include: cutting an original image according to an area indicated by a current user to obtain a required image; alternatively, the original image is cropped in a static cropping mode (i.e., a fixed cropping position is determined for cropping) to obtain a desired image. Practice shows that the mode of manually specifying the cutting area has low efficiency and high labor cost; the static clipping mode can fix the clipping position and has poor flexibility. Based on this, how to better realize image cropping becomes a research hotspot.

Disclosure of Invention

The embodiment of the invention provides a video processing method, a video processing device, video processing equipment and a storage medium, which can realize flexible cutting of a target image, so that the cut image can better attract the attention of a user and improve the viscosity of the user.

In one aspect, an embodiment of the present application provides an image cropping method, including:

acquiring a target image to be processed, and determining a cutting frame for cutting the target image;

performing visual saliency prediction on the target image to obtain saliency information of the target image, wherein the saliency information is used for indicating the distribution condition of a saliency area in the target image, and the saliency area is an area capable of attracting the attention of a user in the target image;

determining a target position of the cutting frame in the target image according to the significance information, wherein the target position refers to: when the attribute of the saliency region included in the crop box meets the attribute condition, the position of the crop box in the target image;

and cutting the target image by adopting the cutting frame at the target position to obtain a cut image.

In another aspect, the present application provides an image cropping device, including:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a target image to be processed and determining a cutting frame for cutting the target image;

the processing unit is used for carrying out visual saliency prediction on the target image to obtain saliency information of the target image, wherein the saliency information is used for indicating the distribution situation of a saliency area in the target image, and the saliency area refers to an area which can attract the attention of a user in the target image;

the processing unit is further configured to determine, according to the saliency information, a target position of the crop box in the target image, where the target position is: when the attribute of the saliency region included in the crop box meets the attribute condition, the position of the crop box in the target image;

the processing unit is further configured to perform cropping processing on the target image at the target position by using the cropping frame, so as to obtain a cropped image.

In an embodiment, the processing unit is configured to determine, according to the saliency information, a target position of the crop box in the target image, and specifically is configured to:

determining a sliding direction of the cutting frame in the target image;

sliding the cutting frame in the target image according to the sliding direction to determine a plurality of candidate positions of the cutting frame in the target image, wherein each candidate position refers to a position of the cutting frame in the target image after each sliding;

calculating the attribute of a saliency region included by the cutting frame at each candidate position according to the saliency information;

and selecting a candidate position corresponding to the target attribute meeting the attribute condition from the plurality of candidate positions as the target position of the cutting frame in the target image.

In one embodiment, the processing unit is configured to determine a sliding direction of the crop box in the target image, and specifically is configured to:

acquiring the width-height ratio of the target image and the width-height ratio of the cutting frame;

if the width-height ratio of the target image is larger than the width-height ratio of the cutting frame, determining the sliding direction of the cutting frame in the target image as a horizontal sliding direction;

and if the width-height ratio of the target image is smaller than the width-height ratio of the cutting frame, determining the sliding direction of the cutting frame in the target image as a vertical sliding direction.

In one embodiment, the saliency information includes saliency probability values of respective pixel points in the target image; the attribute of the saliency area included by the crop box at any candidate position includes: calculating the significance according to the significance probability value of each pixel point included by the cutting box in any candidate position;

the attribute conditions include: a condition of significance greater than a significance threshold, or a condition of maximum significance.

In an embodiment, the processing unit is configured to calculate an attribute of a salient region included in the crop box at each candidate position, and specifically is configured to:

according to the sliding direction, projecting the significance probability value of each pixel point of the target image into the target image to obtain a projection curve;

for any candidate position, determining a curve segment included by the cutting frame at any candidate position from the projection curve; and integrating the curve segment to obtain the significance of the significance region included by the cutting frame at any candidate position.

In one embodiment, the target image includes P rows × Q columns of pixel points, and values of P and Q are positive integers; the sliding direction comprises a horizontal sliding direction or a vertical sliding direction; the processing unit is configured to project, according to the sliding direction, the saliency probability values corresponding to the pixel points of the target image into the target image to obtain a projection curve, and is specifically configured to:

if the sliding direction is the horizontal sliding direction, sequentially obtaining the sum of significance probability values of all pixel points in a Q-th row in the target image, and using the sum as a projection point of the Q-th row, wherein Q belongs to [1, Q ]; adopting projection points of each row in the target image to perform curve drawing to obtain a projection curve;

if the sliding direction is the vertical sliding direction, sequentially obtaining the sum of significance probability values of all pixel points in a P-th row in the target image, and using the sum as a projection point of the P-th row, wherein P belongs to [1, P ]; and adopting projection points of each row in the target image to perform curve drawing to obtain a projection curve.

In one embodiment, after obtaining the cropped image, the processing unit is further configured to:

calculating the significance of the target image according to the significance probability value of each pixel point in the target image;

according to the significance of the target image and the significance of the cut image, carrying out integrity scoring on the cut image and outputting a scoring result;

wherein the saliency of the cropped image is equal to the saliency of the saliency area included by the crop box at the target position.

In one embodiment, the target image is any frame image in a target video segment; the processing unit is configured to perform clipping processing on the target image at the target position by using the clipping frame to obtain a clipped image, and specifically configured to:

determining target position coordinates of the target position and reference position coordinates of a reference position of the cutting frame in each reference image; the reference image refers to an image in the target video clip except the target image;

calibrating the target position coordinates according to the reference position coordinates to obtain calibrated position coordinates;

and at the position indicated by the calibrated position coordinate, cutting the target image by adopting the cutting frame to obtain a cut image.

In an embodiment, the processing unit is configured to perform calibration processing on the target position coordinates according to each reference position coordinate to obtain calibrated position coordinates, and specifically configured to:

if the number of the images in the target video clip is smaller than a number threshold, performing mean value operation on each reference position coordinate and the target position coordinate to obtain a calibrated position coordinate;

and if the number of the images in the target video clip is larger than the number threshold, performing one-dimensional Gaussian smoothing on each reference position coordinate and the target position coordinate, and determining the calibrated position coordinate according to the smoothing result.

In one embodiment, the smoothing result comprises: the smoothed coordinates of the reference positions and the smoothed coordinates of the target position coordinates are obtained; the processing unit is configured to determine the calibrated position coordinate according to the smoothing result, and specifically configured to:

detecting whether a target image group exists in the target video clip according to the smoothing processing result, wherein the target image group comprises the target image, and the coordinate difference value between the smooth coordinate of the cutting frame in any image in the target image group and the smooth coordinate of the cutting frame in the first frame image in the target image group is smaller than or equal to a difference threshold value;

if so, performing mean value operation on the smooth position coordinates of the cutting frame in each image in the target image group to obtain calibrated position coordinates;

and if the target position coordinate does not exist, taking the smoothed coordinate obtained by smoothing the target position coordinate as the calibrated position coordinate.

In an embodiment, the processing unit is configured to acquire a target image to be processed, and specifically configured to:

acquiring an initial image;

detecting an invalid edge of the initial image, wherein the invalid edge is an edge formed by filling pixels in the image;

if one or more invalid edges exist in the initial image, deleting the one or more invalid edges in the initial image to obtain a target image;

and if one or more invalid edges exist in the initial image, deleting the one or more invalid edges in the initial image to obtain a target image.

In an embodiment, the processing unit is configured to perform invalid edge detection on the initial image, and specifically, to:

determining M detection directions of the initial image, and acquiring a reference value corresponding to the mth detection direction; wherein M is a positive integer, and M belongs to [1, M ];

sequentially scanning each pixel group in the initial image according to the mth detection direction, wherein one pixel group consists of all pixel points which are positioned in the same row or the same column in the initial image;

counting the number of target pixel points in the currently scanned pixel group according to the reference value, wherein the target pixel points refer to: pixel points of which the difference value between the pixel value and the reference value is greater than a difference threshold value;

if the number of the target pixel points meets the number condition, continuing to scan and adding one to the invalid count corresponding to the mth detection direction; otherwise, determining the pixel group scanned currently as a marked pixel group, terminating scanning in the mth detection direction, and acquiring the numerical value of the invalid count when the scanning is terminated in the mth detection direction;

if the value is greater than a first threshold, determining that an invalid edge exists in the mth detection direction of the initial image, and the invalid edge includes: each pixel group of the initial image located before the marker pixel group in the m-th detection direction; otherwise, judging that the initial image has no invalid edge in the mth detection direction.

In an embodiment, if the value is greater than the first threshold, the processing unit is configured to determine that an invalid edge exists in the mth detection direction in the initial image, and specifically is configured to:

if the numerical value is larger than the first threshold value, judging whether the numerical value is larger than a second threshold value, wherein the second threshold value is larger than the first threshold value;

if the value is greater than or equal to the second threshold value, determining that no invalid edge exists in the mth detection direction of the initial image;

and if the numerical value is smaller than the second threshold value, judging that the initial image has an invalid edge in the mth detection direction.

In one embodiment, the target image is an image in a target video, the target video includes N frames of images, N is a positive integer; the processing unit is configured to perform visual saliency prediction on the target image to obtain saliency information of the target image, and specifically, is configured to:

acquiring a significance prediction model, wherein the significance prediction model comprises a time flow network and a space flow network;

acquiring one or more frames of associated images of the target image, wherein each frame of associated image and the target image form a continuous image sequence;

calling the time flow network to carry out significance prediction on the target image according to the relevance between each frame of associated image and the target image to obtain a time sequence significance result of the target image;

calling the spatial stream network to carry out significance prediction on the target image to obtain a spatial significance result of the target image;

and fusing the time sequence significance result and the space significance result to obtain significance information of the target image.

In one embodiment, the significance prediction model further comprises a convolution gaussian layer, wherein the convolution gaussian layer is obtained based on a plurality of gaussian kernel training with different variance sizes; the processing unit is configured to fuse the time-series saliency result and the spatial saliency result to obtain saliency information of the target image, and specifically configured to:

fusing the time sequence significance result and the space significance result to obtain a fusion result;

and calling the convolution Gaussian layer to carry out calibration processing on the fusion result to obtain the significance information of the target image.

Accordingly, the present application provides a computer device comprising a processor, a memory and a communication interface, the processor, the memory and the communication interface being connected to each other, wherein the memory is used for storing a computer program, the computer program comprises program instructions, and the processor is configured to call the program instructions to execute the image cropping method described above.

Accordingly, the present application provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the image cropping method described above.

Accordingly, the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the image cropping method.

According to the method and the device, after the target image to be processed is obtained, the saliency information used for indicating the distribution condition of the saliency area in the target image can be obtained by performing visual saliency prediction on the target image, and the target position of the cutting frame in the target image can be flexibly determined according to the saliency information. Then, the target image can be cut by adopting the cutting frame at the target position, so that the cut image can include the predicted salient region as much as possible, the cut image can better attract the attention of the user, and the viscosity of the user is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1a is a schematic diagram illustrating a horizontal screen to vertical screen cutting according to an embodiment of the present application;

fig. 1b is a schematic diagram illustrating a vertical screen to horizontal screen cutting according to an embodiment of the present disclosure;

fig. 2 is a flowchart of an image cropping method according to an embodiment of the present application;

FIG. 3 is a flowchart of another image cropping method provided in the embodiments of the present application;

FIG. 4a is a schematic diagram of an image with invalid edges according to an embodiment of the present disclosure;

fig. 4b is a schematic diagram of deleting an invalid edge in an image according to an embodiment of the present disclosure;

FIG. 4c is a diagram of a model architecture of a significance prediction model provided in an embodiment of the present application;

fig. 4d is a schematic diagram illustrating a saliency of a saliency region calculated according to saliency information according to an embodiment of the present application;

FIG. 4e is a diagram illustrating a plurality of results that may be output by a computing device according to an embodiment of the present disclosure;

fig. 4f is a schematic diagram of a clipping process of an original image according to an embodiment of the present disclosure;

FIG. 4g is a schematic diagram of a trimming step according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an image cropping device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

With the continuous development of internet technology, the Artificial Intelligence (AI) technology has also been developed. The artificial intelligence technology refers to a theory, a method, a technology and an application system which simulate, extend and expand human intelligence by using a digital computer or a machine controlled by the digital computer, sense the environment, acquire knowledge and use the knowledge to obtain the best result. In other words, artificial intelligence is an integrated technique of computer science; the intelligent machine is mainly produced by knowing the essence of intelligence and can react in a manner similar to human intelligence, so that the intelligent machine has multiple functions of perception, reasoning, decision making and the like. Accordingly, AI technology is a comprehensive discipline, which mainly includes Computer Vision technology (CV), speech processing technology, natural language processing technology, and Machine Learning (ML)/deep Learning.

The computer vision technology is a science for researching how to make a machine see, and in particular, the computer vision technology is to use a camera and a computer to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further perform graphic processing, so that the computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data; which typically includes techniques for image processing, video semantic understanding, video content/behavior recognition, and the like. Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of AI, is the fundamental approach to making computers intelligent, and is applied across various areas of artificial intelligence. Machine learning/deep learning generally includes techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Based on a computer vision technology and a machine learning technology in an AI technology, the embodiment of the application provides an image cropping scheme to better crop an independent image or each frame of image in a video. The image cropping scheme may be executed by a computer device, where the computer device may be a terminal or server having image processing capabilities. Among others, the terminal may include but is not limited to: smart phones (such as Android phones, IOS phones, and the like), tablet computers, portable personal computers, mobile internet devices (MID for short), and the like, which are not limited in the embodiments of the present application. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and an artificial intelligence platform, which is not limited in the embodiment of the present application.

In a specific implementation, the general principle of the image cropping scheme is as follows: firstly, a significance prediction model can be obtained based on machine learning technology training, and the significance prediction model is a model with visual significance prediction capability; by visual saliency prediction is meant: the method comprises the steps of simulating human visual features through an algorithm, and extracting a salient region (namely a region of interest of a human) in an image/video. When the requirement of image cutting exists for a certain image, the saliency prediction model can predict a saliency area in the target image, so that the position of a cutting frame in the target image can be flexibly determined according to a prediction result; and the target image is cut by adopting the cutting frame at the determined position, so that the cut image can include the salient region predicted by the model as much as possible, the cut image can better attract the attention of a user, and the viscosity of the user is improved.

In practical application, the image clipping scheme can be applied to various image clipping scenes according to actual requirements, such as a clipping scene of a horizontal screen to a vertical screen, a clipping scene of a vertical screen to a horizontal screen, and the like. Wherein, the horizontal screen refers to a screen with a width-height ratio larger than 1 (such as a screen of 16: 9), and the vertical screen refers to a screen with a width-height ratio smaller than 1 (such as a screen of 9: 16); the aspect ratio herein refers to a ratio between the width and the height, and the width refers to a length in the horizontal direction and the height refers to a length in the vertical direction. Correspondingly, the horizontal screen is rotated to the vertical screen, which means that: converting a landscape screen image (or landscape screen video) displayed on a landscape screen into a portrait screen image (or portrait screen video) displayed on a portrait screen through image cropping; the rotation of the vertical screen to the horizontal screen means that: the vertical screen image (or vertical screen video) displayed on the vertical screen is converted into the horizontal screen image (or horizontal screen video) displayed on the horizontal screen through image cropping. And aiming at any image cutting scene, the computer equipment can call the image cutting scheme to cut the target image based on different width-height ratios. For example, in a landscape to portrait cropping scene: for a target image with a width-to-height ratio of 16:9, the computer device may employ the image cropping scheme to crop the target image into a cropped image with a width-to-height ratio of 9:16, 1:1, 4: 3; and each cut image can accurately and completely reserve the content concerned by the user in the target image, as shown in fig. 1 a. For another example, in a clipping scene of turning a vertical screen into a horizontal screen: for a target image with a width to height ratio of 9:16, the computer device may employ the image cropping scheme to crop the target image to a width to height ratio of 16:9, as shown in FIG. 1 b.

Therefore, the image cropping scheme provided by the embodiment of the application can be applied to width-to-height conversion of images or videos and video material editing in various forms. On one hand, the scheme can perform wide-high ratio conversion of images or videos aiming at different devices (according to the size of a display screen of the display device) for video resources such as news, dance heddles, movie and television series, online education videos, advertisements, sports, performances, animations and the like which exist already. For example, old film and television works such as TV interviews are usually in 4:3, and can be cut according to different devices, such as 3:2 for a flat panel device and 9:16 (vertical screen) or 16:9 (horizontal screen) for a mobile phone. On the other hand, the scheme can carry out video editing according to different applications/requirements. Specifically, for live users, Vlog users, social media workers, advertisement video editing and the like, the image clipping scheme provided by the embodiment of the application can automatically clip original images or video materials to obtain images or videos with different widths and proportions according to length-width proportion format requirements of images or videos of different platforms, for example, the length-width proportion format requirement of images or videos of a XXX platform is 16:9, the length-width proportion format requirement of images or videos of a YYYY space is 9:16, and the length-width proportion format requirement of images or videos of a ZZZ webpage is 1:1 and the like.

Based on the above description of the image cropping scheme, an embodiment of the present invention proposes an image cropping method, which is executable by the above mentioned computer device; referring to fig. 2, the image cropping method may include the following steps S201 to S204:

s201, acquiring a target image to be processed, and determining a cutting frame for cutting the target image.

The target image may be a single image (e.g., a photograph), or any frame image in the target video segment, and the target image to be processed may also be an image obtained by performing preprocessing (e.g., resolution conversion processing, invalid edge detection processing, etc.) on the original image, which is not limited herein. When determining a cutting frame for cutting the target image, the computer device acquires a cutting proportion designated by a user and determines the cutting proportion as the width-height proportion of the cutting frame. And if the width-height ratio of the cutting frame is smaller than the width-height ratio of the target image, determining the height of the cutting frame according to the height of the target image, enabling the height of the image in the cutting frame to be consistent with the height of the target image, and calculating the width of the cutting frame according to the height of the cutting frame and the width-height ratio to determine the cutting frame. Similarly, if the width-height ratio of the cutting frame is larger than that of the target image, the width of the cutting frame is determined according to the width of the target image, so that the width of the image in the cutting frame is consistent with that of the target image, and the height of the cutting frame is calculated according to the width of the cutting frame and the width-height ratio, so that the cutting frame is determined.

S202, performing visual saliency prediction on the target image to obtain saliency information of the target image.

The saliency information is used for indicating the distribution of saliency areas in the target image, and the saliency areas refer to areas capable of attracting the attention of the user in the target image. For example, in an image a, the user usually focuses more on the content of the central area of the image than on the content of the periphery of the image; then, the salient region in the image a may be an image center region. For another example, in an image B relating to the display of an item, the user is generally more concerned about the item displayed in the image B than the presenter who displays the item; then the salient region in this image B may be the display region of the article, etc.

In a specific implementation, the saliency information of the target image may include a saliency probability value of each pixel point in the target image; the significance probability value of any pixel point is used for indicating the probability that the user pays attention to the pixel point when watching the target image (namely, the probability that the pixel point can attract the attention of the user). It can be understood that the attention of the user to the pixel point is in direct proportion to the significance probability value of the pixel point; also, one or more salient regions may be included in the target image. In one embodiment, when the distribution of the salient regions in the target image is indicated by the saliency information, the computer device may further generate a thermodynamic diagram about the target image according to the saliency information, so that the distribution of the salient regions in the target image is intuitively reflected by the thermodynamic diagram. The thermodynamic diagram is: highlighting each pixel point in the target image by adopting different display colors according to the significance probability value of each pixel point to obtain an image; the depth of the display color of any pixel point in the thermodynamic diagram is in direct proportion to the significance probability value of the pixel point. Specifically, the computer device may sequentially traverse each pixel point in the target image, determine a display color for highlighting the currently traversed pixel point according to a significance probability value of the currently traversed pixel point, and highlight the currently traversed pixel point in the target image according to the display color; and after all the screened pixel points are traversed, the thermodynamic diagram of the target image can be obtained. After outputting the thermodynamic diagram of the target image, the user may regard the darker areas in the thermodynamic diagram as salient areas.

Optionally, one saliency region in the target image may be a region formed by a plurality of continuous pixels in the target image, where the saliency probability value is greater than the saliency threshold. In another embodiment, the computer device may also divide the target image into a plurality of regions, calculate a mean value of the significance probability values of all the pixel points in each region, and use the region with the calculated mean value larger than the region determination threshold as the significance region; that is, in this embodiment, one salient region in the target image refers to: and when the mean value of the significance probability values of all the pixel points is larger than the region judgment threshold value, the region to which each pixel point belongs.

And S203, determining the target position of the cutting frame in the target image according to the significance information.

Wherein, the target position refers to: and when the attribute of the saliency region included in the crop box satisfies the attribute condition, the position of the crop box in the target image. Attributes of the salient region included within the crop box may include any of the following: the number of salient regions included within the crop box, the area of salient regions included within the crop box, the saliency of salient regions included within the crop box, and so forth. And calculating the significance of the significance region included in the cutting box according to the significance probability value of each pixel point included in the cutting box. Accordingly, the attributes of the saliency areas when included within the crop box include: in clipping the number of salient regions included in the box, the attribute conditions may include: the condition that the number is maximum or the condition that the number is greater than a preset threshold value; when the attributes of the salient region included within the crop box include: when clipping the area of a salient region included in a frame, the attribute condition may include: a condition of a maximum area, or a condition of an area greater than an area threshold; when the attributes of the salient region included within the crop box include: when the saliency of a saliency region included within a crop box is significant, the attribute conditions may include: a condition of maximum significance, or a condition of significance greater than a significance threshold.

In the specific implementation process of step S203, the computer device may slide the cropping frame in the target image based on the sliding step length according to the sliding direction of the cropping frame, and determine the position of the cropping frame after each sliding as a candidate position of the target image until the entire target image is traversed, so as to obtain a candidate position set of the cropping frame in the target image. Secondly, the computer equipment can calculate the attribute of the saliency region included by the cutting frame at each candidate position according to the saliency information; and then, selecting a candidate position corresponding to the target attribute meeting the attribute condition from the plurality of candidate positions as a target position of the cutting frame in the target image. The sliding direction of the cutting frame comprises a horizontal sliding direction or a vertical sliding direction; the sliding step length of the cutting frame is used for indicating the sliding distance between the position after sliding at each time and the last position, the specific value can be set according to an empirical value, the sliding step length can be in units of pixel points, and when the sliding step length is 5, the sliding distance between the position after sliding of the cutting frame and the last position is 5 pixel points. It will be appreciated that the smaller the distance per slide (i.e., the shorter the slide step size), the more candidate positions of the crop box are included in the target image, and thus the slide step size may also be determined according to influencing factors such as the size of the target image, the performance of the computer device, etc. It should also be noted that, in other embodiments, the computer device may determine the candidate position of the crop box in the target image directly from the target image by recognizing the image content of the target image according to an empirical value or through a model instead of determining the candidate position by sliding the crop box.

And S204, cutting the target image by adopting the cutting frame at the target position to obtain a cut image.

In one embodiment, after determining the target position of the crop box, the computer device may directly move the crop box to the target position in the target image according to the target position coordinates of the target position, and cut out the image content included in the crop box at the target position from the target image to obtain the cut image. In another embodiment, if the target image is any frame image in the target video segment, the computer device may further perform calibration processing on the position coordinates of the target position according to the position coordinates of the cropping frame in other images in the target video segment, so as to move the cropping frame to the position indicated by the calibrated position coordinates, and perform clipping processing on the image content included in the cropping frame at this time (i.e., clipping the image content included in the cropping frame from the target image) to obtain the cropped image. After obtaining the cropped image, the computer device may zoom the cropped image according to the size of the display screen, thereby outputting the zoomed image (e.g., the computer device directly displays the zoomed image, or sends the zoomed image to the display device for display).

Based on the above description of the image cropping scheme, the embodiment of the present invention proposes another image cropping method, which can be executed by the above mentioned computer device; referring to fig. 3, the image cropping method may include the following steps S301 to S310:

s301, acquiring an initial image, and detecting invalid edges of the initial image.

The initial image may be a single image or any frame image in the initial video. The invalid edge is an edge formed by filling pixels into an image, and the colors of the pixels used for filling are the same color (such as black and white); when the color of the pixel point used for filling is black, the invalid edge can be called as a black edge, and when the color of the pixel point used for filling is white, the invalid edge can be called as a white edge, and so on. The manufacturing process of the initial image is as follows: firstly, generating an original image through the operation of image acquisition and production; secondly, whether the size of the original image is the same as a preset size or the size of a device display window can be detected. If the images are the same, directly taking the original image as an initial image; and if the difference is not the same, adjusting the size of the original image to obtain an initial image. Wherein adjusting the size of the original image comprises: adjusting the height and/or width of the original image; referring to fig. 4a, when the height of the original image obtained by image capture is smaller than a preset height or smaller than the height of the device display window, pixel filling is usually performed above and/or below the original image to increase the height of the original image, so that the height of the filled image (i.e., the initial image) is consistent with the preset height or the height of the device display window. Similarly, if the width of the original image is smaller than the preset width or smaller than the width of the device display window, pixel filling may be performed on the left side and/or the right side of the original image to increase the width of the original image, so that the width of the filled image (i.e., the initial image) is consistent with the preset width or the width of the device display window. And filling the area of the pixel points except the original image as the area where the invalid edge is located.

As can be seen from the above description, in the process of making the initial image, there may be an invalid edge (such as a black edge or a white edge) in the initial image due to image adjustment (i.e., pixel filling); in this case, if the image is directly cropped on the initial image, an invalid edge may exist in the cropped image, which may affect the cropping effect. Therefore, after the initial image is acquired, the computer equipment can detect the invalid edge of the initial image, so that when the invalid edge exists in the initial image, the image cutting can be executed after the invalid edge is deleted, and the image cutting effect is improved. In an implementation, the implementation of the invalid edge detection for the initial image by the computer device may include the following steps s11-s 14:

s 11: and determining M detection directions of the initial image, and acquiring a reference value corresponding to the M-th detection direction traversed currently.

Wherein M is a positive integer; the M detection directions may include at least one of: a detection direction from top to bottom, a detection direction from bottom to top, a detection direction from left to right, and a detection direction from right to left. After determining the M detection directions, the computer equipment can respectively detect whether an invalid edge exists in each detection direction of the initial image; it should be noted that, the computer device may simultaneously and concurrently detect the invalid edge in each detection direction, or may sequentially and one by one detect the invalid edge in each detection direction, which is not limited in the embodiment of the present invention.

Wherein M belongs to [1, M ]; the reference value corresponding to the mth detection direction is: and the metric value is used for judging whether the invalid edge exists in the mth detection direction of the initial image. The reference value corresponding to the mth detection direction may be obtained by calculating a mean value of all pixel points in the previous i pixel groups of the initial image in the mth detection direction, and a value of i may be set according to an empirical value. Wherein, a pixel group is composed of all pixel points which are positioned in the same row or the same column in the initial image; specifically, if the current traversal detection direction is a detection direction from bottom to top or a detection direction from top to bottom, one pixel group is composed of all pixel points located in the same row in the initial image; correspondingly, if the currently traversed detection direction is a left-to-right detection direction or a right-to-left detection direction, one pixel group is composed of all pixel points located in the same column in the initial image. Optionally, the reference value corresponding to the mth detection direction may also be set in advance according to an empirical value; for example, if a black border in the initial image needs to be detected, the reference value is directly set to [0,0,0 ].

It should be noted that, if the measure values for measuring whether the initial image has the invalid edges in different detection directions are all the same, that is, the reference values corresponding to different detection directions are the same, after the computer device obtains the reference value corresponding to the first detection direction, the computer device may use the reference value as the criterion for the invalid edge when traversing other detection directions except the first detection direction, and the step of obtaining the reference values of other detection directions does not need to be performed again. If the measure values for measuring whether the initial image has the invalid edge in different directions are different, that is, the reference values corresponding to different detection directions are the same, the computer device needs to determine the reference value corresponding to the currently traversed detection direction every time the computer device traverses one detection direction.

s 12: and sequentially scanning each pixel group in the initial image according to the mth detection direction, and counting the number of target pixel points in the currently scanned pixel group according to the reference value.

The target pixel point is a pixel point of which the difference value between the pixel value and the reference value is greater than the difference threshold value; for example, assuming that the difference threshold is 3, the reference value is [0,0,0], the pixel value of the pixel 1 is [0,1,0], the pixel value of the pixel 2 is [3,7,5], and since the difference value between the pixel 2 and the reference value is greater than the difference threshold, the pixel 2 is determined as the target pixel.

The arrangement order of the respective pixel groups is determined according to the mth detection direction; if the mth detection direction is the detection direction from top to bottom, determining a pixel group corresponding to the pixel point of the mth row in the initial image as the mth pixel group to be scanned, wherein r is a positive integer; similarly, the initial image is assumed to include N rows of pixel points, if the mth detection direction is the detection direction from bottom to top, then the (r + 1) th scanned pixel group corresponding to the pixel points of the (N-r) th row in the initial image, and N is a positive integer. When the detection direction is a left-to-right detection direction or a right-to-left detection direction, the manner of determining the arrangement order of each pixel group is similar to that when the detection direction is a top-to-bottom detection direction or a bottom-to-top detection direction, and details are not repeated here.

s 13: if the number of target pixel points in the currently scanned pixel group meets the number condition, continuing to scan and executing an adding process on the invalid count corresponding to the mth detection direction; otherwise, the currently scanned pixel group is determined as a mark pixel group, scanning is terminated in the mth detection direction, and the value of the invalid count at the time of terminating scanning in the mth detection direction is acquired.

The number of the target pixel points satisfies the number condition, which includes: the number of target pixel points in the currently scanned pixel group is smaller than a number threshold, or the ratio of the number of the target pixel points in the currently scanned pixel group to the number of all pixel points in the currently scanned pixel group is smaller than a ratio threshold. The quantity threshold and the ratio threshold can be set according to experience values or service requirements.

s14, if the value of the invalid count is larger than the first threshold, judging that the initial image has an invalid edge in the m-th detection direction; otherwise (i.e. the value of the invalid count is less than or equal to the first threshold), it is determined that no invalid edge exists in the m-th detection direction in the initial image.

Wherein the invalid edge comprises: the initial image is located at the pixel points in each pixel group before the marking pixel group in the mth detection direction, and the marking pixel group is the pixel group which is scanned from the initial image in the mth detection direction and does not meet the quantity condition of the target pixel points in the first group. For example, assume that the mth detection direction is a top-to-bottom detection direction; the first threshold value is 5, if the numerical value of invalid counting when scanning is terminated in the detection direction from top to bottom is 10, the initial image is judged to have an invalid side in the detection direction from top to bottom, and the invalid side comprises the first 10 rows of pixel points of the initial image; if the number of invalid counts at the time of terminating scanning in the top-down detection direction is 3, it is determined that no invalid edge exists in the top-down detection direction in the initial image.

Research shows that when the number of invalid pixel groups in a certain direction of an initial image is large, two situations may exist: one is that the original image itself has filled pixels, and the other is that the original image is a pure color image, which is an image composed of a plurality of pixel points having the same pixel value. When the initial image is a solid image, it can be considered that no invalid edge exists in the initial image. Based on this, it can be further detected through the second threshold that the number of invalid pixel groups (i.e. invalid count) corresponding to the mth detection direction is greater than the first threshold, whether the initial image is a pure color image or an invalid edge exists in the initial image, thereby improving the accuracy of detecting the invalid edge. Specifically, after determining that the value of the invalid count is greater than the first threshold, it is further determined whether the value of the invalid count is greater than a second threshold (the second threshold is greater than the first threshold), where the second threshold may be calculated from a total number of pixel groups included in the m-th detection direction of the initial image and a preset ratio (e.g., 30%). If the value of the invalid count is greater than or equal to the second threshold, the initial image can be considered as a pure color image, and at this time, the initial image can be judged to have no invalid edge in the mth detection direction; if the value of the invalid count is smaller than the second threshold, the initial image is considered not to be a solid image, and at this time, it can be determined that an invalid edge exists in the mth detection direction of the initial image.

The following describes in detail the scheme of invalid edge detection provided by the present application, taking the detection direction as the top-to-bottom detection direction as an example:

calculating a color average value of pixel points of the first i (if i is 1) rows of the initial image, wherein i is a positive integer, and determining the color average value as a reference value; calculating difference values between pixel values (RGB pixel values) of all pixel points in each line and a reference value in sequence from a first line, and judging whether the number of target pixel points of which the difference values exceed a difference threshold meets a number condition (for example, judging whether the number of the target pixel points in the current line is smaller than a number threshold or judging whether the ratio of the number of the target pixel points in the current line to the number of all pixel points in the current line is smaller than a proportion threshold); if the number of target pixels with the difference values exceeding the difference threshold in the current row meets the number condition (namely the number of the target pixels in the current row is smaller than the number threshold, or the ratio of the number of the target pixels in the current row to the number of all pixels in the currently scanned pixel group is smaller than the proportional threshold), performing one-plus-one processing on the invalid count, and continuously detecting whether the number of the target pixels with the difference values exceeding the difference threshold in the next row meets the number condition or not until the number of the target pixels with the difference values exceeding the difference threshold in the current row is detected not to meet the number condition. Acquiring a numerical value m of the invalid count, and if the numerical value m of the invalid count is less than or equal to a first threshold (namely, invalid edges above the initial image are fewer and can be ignored), or m is greater than or equal to a second threshold (namely, the initial image may be a pure color (such as black) image), determining that no invalid edge exists above the initial image; and if the numerical value m of the invalid count is larger than the first threshold and smaller than the second threshold, judging that an invalid edge exists above the initial image. The invalid edge comprises the first m rows of pixel points of the initial image. And performing invalid edge detection on the initial image from other three directions (from bottom to top, from left to right and from right to left) according to a similar processing mode to obtain an invalid edge detection result of the initial image.

S302, if one or more invalid edges exist in the initial image, deleting the one or more invalid edges in the initial image to obtain the target image.

For example, if the initial image has invalid edges in both the top-to-bottom detection direction and the small-to-top detection direction, the invalid edges above and below the initial image can be deleted respectively to obtain the target image, as shown in fig. 4 b. Optionally, after the computer device obtains the initial image, the computer device may also directly use the initial image as the target image to be processed without performing invalid edge detection on the initial image.

And S303, performing visual saliency prediction on the target image to obtain saliency information of the target image.

In one embodiment, if the target image is an image in the target video, the target video includes N frames of images, and N is a positive integer, the specific implementation of step S303 may be:

first, a saliency prediction model is obtained, which includes a temporal flow network and a spatial flow network. The significance prediction model can be obtained by optimally training the convolutional neural network by adopting a training data set (comprising input data and labeled data). The convolutional neural network includes: a Visual Geometry Group (VGG) Network, a Residual Network (ResNet), a mobile Network (MobileNet), and the like. Specifically, a convolutional neural network is adopted to perform significance prediction on input data in a training data set to obtain prediction data. And performing difference operation on the prediction data and the labeling data through a loss function, and adjusting parameters in the convolutional neural network according to the result of the difference operation to obtain a significance prediction model. Wherein the model loss function is constructed based on at least one of the following three significance prediction evaluation indexes: the luminous flux (Lucas-Kanade, LK) distance; pearson Correlation Coefficient (Pearson Linear Correlation Coefficient), also called Linear Correlation Coefficient CC; and Normalized Scanpath significance index (NSS).

In one embodiment, the parameters of the feature extraction modules in the temporal flow network and the spatial flow network in the significance prediction model may be the same, and setting the parameters of the feature extraction modules in the temporal flow network and the spatial flow network to the same parameters may compress the volume of the significance prediction model. It should be noted that the temporal flow network is different from the spatial flow network in that the dimension of the convolutional layer of the temporal flow network is different from the dimension of the convolutional layer of the spatial flow network (the spatial flow network does not include the temporal dimension).

Secondly, one or more frames of associated images of the target image can be acquired, each frame of associated image and the target image form a continuous image sequence, and the target image can be an image at any position in the image sequence (for example, the target image can be the first image in the image sequence, the last image in the image sequence, or an image between the first image and the last image); calling a time flow network to perform significance prediction on a target image according to the relevance (such as time sequence) between each frame of associated image and the target image to obtain a time sequence significance result of the target image, wherein the time sequence significance result is used for indicating a first probability value of each pixel point in the target image, and the time sequence significance result of the target image can be a feature vector of the target image in time sequence or a thermodynamic diagram of the target image in time sequence. For example, assuming that a pet a and a pet B are in a target image, and the target image belongs to a target video, if prediction is performed based on the target image alone, it is difficult to determine the motion states of the pet a and the pet B, but based on the features of the target image, the features of each frame of associated image and the relevance of the target image to each frame of associated image, the motion states of the pet a and the pet B can be determined; assuming that the moving object attracts the attention of the user more, the pet a is in a moving state, and the pet B is in a stationary state, the region of the pet a in the target image is determined as a salient region. Optionally, before invoking the time flow network to predict the saliency of the target image according to the relevance (e.g., time sequence) between each frame of the associated image and the target image, and obtain a time sequence saliency result of the target image, the computer device may further adjust the resolutions of the target image and one or more frames of the associated image of the target image according to the current situation (e.g., performance of the processing device) (e.g., adjust the resolutions of the target image and one or more frames of the associated image of the target image to 360 × 640 pixels). It is understood that the higher the resolution, the more features the target image contains, and the lower the resolution, the faster the saliency prediction model is processed.

In addition, a spatial stream network can be called to carry out significance prediction on the target image to obtain a spatial significance result of the target image; specifically, the spatial stream network predicts and obtains a spatial saliency result of the target image based on the features of the target image; the spatial saliency result is used for indicating a second probability value of each pixel point in the target image, and the spatial saliency result of the target image can be a feature vector of the target image on a space or a thermodynamic diagram of the target image on the space. Finally, a time sequence significance result and a space significance result can be fused to obtain significance information of the target image; specifically, the time sequence saliency result and the space saliency result may be directly subjected to fusion processing (for example, a convolution network is adopted to perform convolution processing on the time sequence saliency result and the space saliency result, or a mean value operation is directly performed on a first probability value and a second probability value of each pixel point in the target image), so as to obtain saliency information of the target image. The saliency information of the target image comprises saliency probability values of all pixel points in the target image, and the saliency probability value of each pixel point is obtained through calculation according to the first probability value and the second probability value.

Furthermore, the significance prediction model further comprises a convolution Gaussian layer, the convolution Gaussian layer is obtained based on a plurality of Gaussian kernels with different variances through training, the convolution Gaussian layer is used for correcting significance information of a target area in the target image, and the target area is determined according to the Gaussian kernels with different variances. Accordingly, the specific implementation of fusing the time-series saliency result and the spatial saliency result to obtain the saliency information of the target image may include any one of the following:

the first implementation mode comprises the following steps: the time sequence significance result and the space significance result can be fused (for example, the time sequence significance result and the space significance result can be subjected to convolution processing by adopting a convolution network) to obtain a fusion result; and the fusion result is used for indicating the fusion probability value of each pixel point in the target image, and the fusion probability value of each pixel point is obtained by calculation according to the first probability value and the second probability value. And then, calling the convolution Gaussian layer to perform calibration processing on the fusion result (for example, performing convolution processing on the fusion result again through the convolution Gaussian layer) to obtain the significance information of the target image.

The second embodiment: firstly, respectively calibrating a space significance result of the target image and a time sequence significance result of the target image through a convolution Gaussian layer; the calibrated time sequence significance result is used for indicating a calibrated first probability value of each pixel point in the target image; the calibrated spatial saliency result is used to indicate a calibrated second probability value for each pixel point in the target image. And then, fusing the calibrated spatial saliency result and the calibrated time sequence saliency result (for example, performing convolution processing on the calibrated spatial saliency result and the calibrated time sequence saliency result by adopting a convolution network) to obtain saliency information of the target image.

The third embodiment is as follows: and in the process of fusing the time sequence significance result of the target image and the space significance result of the target image, calibrating the fusion characteristics in the fusion process through a convolution Gaussian layer to obtain the significance information of the target image.

It should be noted that, the convolution gaussian layers with different functions can be obtained by training the convolution gaussian layers through different types of gaussian kernels; for example, a convolution Gaussian layer 1 obtained after training a convolution Gaussian layer by adopting a first type of Gaussian kernel is used for enhancing the significance of the central region of the target image and inhibiting the significance of the edge region of the target image; and the convolution Gaussian layer 2 obtained after the convolution Gaussian layer is trained by the second type of Gaussian core is used for enhancing the significance of the left area of the target image and inhibiting the significance of the right area of the target image.

Based on the above description, fig. 4c exemplarily shows a model architecture of a significance prediction model. As shown in fig. 4c, the significance prediction module mainly includes a temporal flow network, a spatial flow network and a fusion module. The time flow network comprises a feature extraction module and a 3D convolutional layer, wherein the feature extraction module is used for extracting features of each frame of image in a continuous image sequence (comprising at least 2 frames of images), and the 3D convolutional layer is used for predicting a time sequence significance result of a currently processed image in the continuous image sequence based on the features and association relations (such as time sequence relations) of each frame of image; the spatial stream network comprises a feature extraction module and a 2D convolutional layer, wherein the feature extraction module is used for extracting the features of the current processing image, and the 2D convolutional layer is used for predicting the spatial significance result of the current image based on the features of the current processing image; the fusion module is used for fusing a time sequence significance result output by the time flow network and a space flow network output by the space flow network to obtain significance information of a current processing image, and comprises a Gaussian convolution layer which is obtained by training a 2D convolution layer based on a plurality of Gaussian kernels with different variance sizes.

In another embodiment, if the target image is a single image, the specific implementation of step S303 may be: and acquiring an image prediction model, and calling the image prediction model to perform visual saliency prediction processing on the target image to obtain saliency information of the target image. The image prediction model is constructed based on a spatial flow network; specifically, the spatial stream network predicts a spatial saliency result of the target image based on the features of the target image and determines the spatial saliency result as saliency information of the target image.

Similarly, the image prediction model further comprises a convolution Gaussian layer, and after the spatial saliency result of the target image is obtained through the spatial flow network based on the feature prediction of the target image, the spatial saliency result of the target image is calibrated through the convolution Gaussian layer to obtain the saliency information of the target image; for example, the spatial saliency result of the target image indicates that the saliency region of the target image is located in the upper right region of the target image, and after the spatial saliency result of the target image is subjected to calibration processing through a convolution Gaussian layer, the obtained saliency information of the target image indicates that the saliency region of the target image is located in the central region of the target image.

And S304, determining the sliding direction of the cutting frame in the target image.

And determining the sliding direction of the cutting frame in the target image according to the width-height ratio of the target image and the width-height ratio of the cutting frame. In one embodiment, the aspect ratio of the target image is greater than the aspect ratio of the crop box. The computer device determines the height of the cropping frame according to the height of the target image, so that the height of the image in the cropping frame is consistent with the height of the target image, and determines the sliding direction of the cropping frame as the horizontal sliding direction (i.e. sliding left and right).

In another embodiment, the aspect ratio of the target image is less than the aspect ratio of the crop box. The computer device determines the width of the crop box according to the width of the target image, so that the width of the image in the crop box is consistent with the width of the target image, and determines the sliding direction of the crop box as a vertical sliding direction (i.e. sliding up and down).

It can be understood that, if the aspect ratio of the target image is consistent with the aspect ratio of the crop box, the target image may be scaled so that the size of the target image is consistent with the size of the current display device. It should be noted that, when the computer device is a terminal, the current display device and the computer device may be the same device or different devices, and the present invention is not limited thereto.

Optionally, if the size of the target image and the size of the cropping frame are both fixed sizes, and the width of the cropping frame is smaller than the width of the target image, the height of the cropping frame is smaller than the height of the target image; the crop box can slide in the target image in the horizontal direction, the vertical direction, or the direction specified by the user or the preset direction.

And S305, sliding the cutting frame in the target image according to the sliding direction to determine a plurality of candidate positions of the cutting frame in the target image.

Each candidate position is determined according to the position in the target image where the crop box is located after each sliding. The sliding distance between the position of the cutting frame after sliding each time and the previous position can be a preset distance (for example, the sliding distance between the position of the cutting frame after sliding and the previous position is N pixel points, where N is a positive integer). It is understood that the smaller the distance per slide, the more candidate positions of the crop box are included in the target image of the same scale, and thus the slide distance may also be determined according to the influence factors (such as the size of the target image, the performance of the server, and the like).

S306, calculating the attribute of the saliency region included by the cutting frame at each candidate position according to the saliency information.

The saliency information includes saliency probability values of respective pixel points in the target image. Attributes of the salient region include: and calculating the significance according to the significance probability value of each pixel point included by the cutting box in any candidate position.

In one embodiment, the saliency probability values of the pixel points of the target image are projected into the target image according to the sliding direction, so that a projection curve is obtained. Specifically, the target image is set to include P rows × Q columns of pixel points, where P and Q are positive integers; if the sliding direction is the horizontal sliding direction, the sum of the significance probability values of all pixel points in the Q-th row in the target image is sequentially obtained to serve as a projection point of the Q-th row, and Q belongs to [1, Q ]; and adopting projection points of each column in the target image to perform curve drawing to obtain a projection curve. If the sliding direction is a vertical sliding direction, sequentially obtaining the sum of the significance probability values of all pixel points in the P-th row in the target image, wherein the sum is used as a projection point of the P-th row, and P belongs to [1, P ]; and adopting projection points of each row in the target image to perform curve drawing to obtain a projection curve.

After the projection curve is obtained, determining a curve segment (namely a part of the projection curve) included by the cutting frame at each candidate position according to the projection curve, and calculating the significance of a significance region included by the cutting frame at the u-th candidate position according to the curve segment included by the cutting frame at the u-th candidate position, wherein the u-th candidate position is any candidate position of the cutting frame, and u is a positive integer. Specifically, the curve segment included in the crop box at the u-th candidate position is integrated to obtain the saliency of the saliency region included in the crop box at the u-th candidate position.

Fig. 4d is a schematic diagram illustrating a saliency of a saliency area calculated according to saliency information according to an embodiment of the present application. As shown in fig. 4d, the sliding direction of the crop box in the target image is the horizontal sliding direction, a coordinate system is established with the lower right corner of the target image as the origin, and the sum of the significance probability values of any point (x, y) on the projection curve for representing the x-th row of pixel points in the target image is y. The attention of the saliency region of the position where the crop box is currently located is the result of the integration processing of the curve segment in the crop box (i.e., the area of S1).

Optionally, the attribute of the saliency area included in the crop box at each candidate position may further include: the area of the saliency region, or the sum of the saliency probability values of the pixels comprised by the crop box at each candidate position.

And S307, selecting a candidate position meeting the attribute condition from the plurality of candidate positions as a target position of the cutting frame in the target image.

In one embodiment, the attributes of the salient region included if the crop box is at any one of the candidate locations include: calculating the significance according to the significance probability value of each pixel point included when the cutting box is at any candidate position; the attribute conditions include: a condition of significance greater than a significance threshold, or a condition of maximum significance. Accordingly, the specific implementation manner of step S307 may be: and traversing a plurality of candidate positions, and if the significance of the significance region included by the cutting frame at the current candidate position is greater than a significance threshold value, or the significance of the significance region included by the cutting frame at the current candidate position is greater than the significance of the significance regions included by the cutting frame at other candidate positions, determining the current candidate position as the target position of the cutting frame in the target image.

In another embodiment, the attribute of the saliency region included if the crop box is at any one of the candidate positions includes: the area of the salient region included by the crop box at any candidate position; the attribute conditions include: a condition of area being maximum, or a condition of area being greater than an area threshold. Accordingly, the specific implementation manner of step S307 may be: and traversing a plurality of candidate positions, and if the area of the saliency region included by the cutting frame at the current candidate position is larger than an area threshold value, or the area of the saliency region included by the cutting frame at the current candidate position is larger than the area of the saliency region included by the cutting frame at other candidate positions, determining the current candidate position as the target position of the cutting frame in the target image.

In another embodiment, the attribute of the saliency region included if the crop box is at any one of the candidate positions includes: the sum of the significance probability values of the pixels included in the crop box at any candidate position. Accordingly, the specific implementation manner of step S307 may be: traversing a plurality of candidate positions, and if the sum of the significance probability values of all pixel points included by the cutting box at the current candidate position is larger than the total threshold value, or the sum of the significance probability values of all pixel points included by the cutting box at the current candidate position is larger than the sum of the significance probability values of all pixel points included by the cutting box at other candidate positions, determining the current candidate position as the target position of the cutting box in the target image.

The following describes steps S304-S307 in detail by way of a complete example:

assuming that the width-height ratio of the target image is 16:9 (horizontal screen) and the width-height ratio of the crop box is 9:16 (vertical screen), the height of the crop box is determined according to the height of the target image, so that the height of the image in the crop box is consistent with the height of the target image, the width of the crop box is calculated according to the height of the crop box and the width-height ratio, and the moving direction of the crop box is determined as the horizontal direction (i.e. the crop box is moved in the horizontal direction to find the optimal crop position, namely the target position). And after the moving direction of the cutting frame is determined, horizontally projecting the significance prediction result on the target image according to the moving direction of the cutting frame to obtain a projection curve, wherein the value of any point on the projection curve is obtained by accumulating the significance probability values of all pixel points on the pixel column to which the point belongs. And sliding the cutting frame in the horizontal direction, and calculating the significance corresponding to the current position of the cutting frame according to the curve segment in the current cutting frame until the cutting frame traverses the whole target image to obtain the significance corresponding to the cutting frame at different (candidate) positions. And determining the candidate position with the highest corresponding significance degree in all the candidate positions as the cutting position (namely the target position).

If the target image is a single image, after the target position of the cropping frame in the target image is determined, the target image can be cropped by directly using the cropping frame at the target position, so as to obtain a cropped image. If the target image is any frame image in the target video clip, continuing to execute step S308; the target video segment herein may be the aforementioned target video, or may be any sub-video segment obtained by splitting the target video when the number of image frames included in the aforementioned target video exceeds a frame number threshold.

S308, determining the target position coordinates of the target position, and determining the reference position coordinates of the reference position of the cutting frame in each reference image.

The target image belongs to the target video, the reference image is an image except the target image in the target video clip, the reference position is the position of the cutting frame in the reference image when the cutting frame cuts the reference image, and the reference position coordinate is the coordinate of the position of the cutting frame in the reference image when the cutting frame cuts the reference image.

In one embodiment, the clipping position of each frame of image of the target video is determined one by one according to the embodiments in step S304-step S307, a coordinate system is established based on the target image (e.g., a coordinate system is established with the upper left corner of the target image as the origin), and corresponding coordinates are determined according to the position of the clipping frame (e.g., the position coordinate of the upper left corner of the clipping frame and the position coordinate of the lower right corner of the clipping frame are determined to represent the position of the clipping frame in the target image); and determining the moving direction of the cutting frame according to the difference of the positions of the cutting frame between the adjacent frames. Table 1 is a coordinate table of a crop box of each frame image in a target video according to an embodiment of the present application:

TABLE 1

Image frame number	X1	Y1	X2	Y2	Integrity scoring
						1	945	0	1945	2134	0.8517
2	945	0	1945	2134	0.8802
						…	…	0	…	2134	…
n	1437	0	2437	2134	0.8517

Wherein, X1 and Y1 are used for indicating the coordinates of the upper left corner of the cutting box; x2, Y2 are used to indicate the coordinates of the bottom right corner of the crop box. The image frame serial numbers sequentially correspond to n frame images in the target video, namely the a-th frame image in the target video corresponds to the image serial number a, wherein a and n are positive integers, and a is less than or equal to n. The image frame number also has a retrieval function, and the coordinates of the cutting frame in the image frame and the integrity score of the image contained in the cutting frame can be uniquely determined through the image frame number. The integrity score is used to indicate the information integrity of the cropped image relative to the image before cropping (i.e. used to evaluate whether the cropped image can completely express the main content of the target image), and the higher the integrity score is, the higher the information integrity of the cropped image is. Optionally, the coordinate table of the cropping frame may further include the saliency of the salient region included in the cropping frame in each image frame (i.e., the saliency of the cropped image).

S309, calibrating the target position coordinates according to the reference position coordinates to obtain calibrated position coordinates.

As can be seen from the foregoing, the target image is any frame image in the target video segment; with the number of images in the target video segment being different, different calibration methods can be used to calibrate the target position coordinates, specifically referring to the following descriptions:

if the number of the images in the target video clip is smaller than the number threshold, performing mean value operation on each reference position coordinate and the target position coordinate to obtain a calibrated position coordinate; for example, assuming a number threshold of 5, a number of images in the target video segment of 3, the target location coordinates include: coordinates of the upper left corner of the cutting box (973, 0) and coordinates of the lower right corner of the cutting box (1973, 2153); the reference position coordinates 1 include: coordinates of the upper left corner of the cutting box (971, 0) and coordinates of the lower right corner of the cutting box (1971, 2153); the reference position coordinates 2 include: coordinates of the upper left corner of the crop box (978, 0) and coordinates of the lower right corner of the crop box (1978, 2153); performing mean value operation on the target position coordinates, the reference position coordinates 1 and the reference position coordinates 2 to obtain calibrated position coordinates as follows: coordinates of the top left corner of the crop box (974, 0), and coordinates of the bottom right corner of the crop box (1974, 2153).

And if the number of the images in the target video clip is larger than the number threshold, performing one-dimensional Gaussian smoothing processing on each reference position coordinate and the target position coordinate, and determining the calibrated position coordinate according to the smoothing processing result. In one embodiment, whether the target video segment has a target image group or not is detected according to the smoothing result, the target image group comprises a target image, and the coordinate difference value between the smooth coordinate of the cutting frame in any image in the target image group and the smooth coordinate of the cutting frame in the first frame image in the target image group is smaller than or equal to a difference threshold value. If the target video clip has the target image group, performing mean value operation on the smooth position coordinates of the cropping frame in each image in the target image group to obtain calibrated position coordinates; and if the target image group does not exist in the target video clip, taking the smoothed coordinates of the target position coordinates after smoothing as the calibrated position coordinates.

And S310, at the position indicated by the calibrated position coordinate, cutting the target image by using a cutting frame to obtain a cut image.

The specific implementation of step S310 can refer to the implementation of step S204 in fig. 2, and is not described herein again. Fig. 4e is a schematic diagram of a plurality of results that can be output by a computer device according to an embodiment of the present application. As shown in fig. 4e, in addition to outputting the cropped target image, the computer device may further output the target image indicating the region of significance (i.e., a thermodynamic diagram of the target image), the target image when the cropping frame is located at the target position, the coordinates of the cropped image shown in table 1, the integrity score of the cropped image, and the like, and one or more of the output results may further detail the cropping process of the target video to the user, and may further provide a reference for the developer to optimize the significance prediction model.

In one embodiment, the computer device calculates the saliency of the target image according to the saliency probability value of each pixel point in the target image; and then, according to the saliency of the target image and the saliency of the cut image, performing integrity scoring on the cut image (for example, calculating the ratio of the saliency of the cut image to the saliency of the target image to obtain the integrity scoring of the cut image), and outputting a scoring result. Wherein the saliency of the clipped image is equal to the saliency of the saliency area included when the clip frame is at the target position.

On the basis of the embodiment of fig. 2, the embodiment of the application can include the salient regions predicted by the models as much as possible by detecting the invalid edges of the initial images and deleting the invalid edges in the initial image frames, so that the cut images can better attract the attention of users and improve the viscosity of the users; by smoothing the coordinates of the cutting frame at the target position, the problem of image shake (lens moving back and forth in a small range) in the cut video can be improved, and user experience is further improved. In addition, the scheme provided by the embodiment of the application can carry out self-adaptive cutting on different images, carries out content integrity scoring on the cutting result, can be applied to video editing and auxiliary editing in a large scale, can save a large amount of time and labor cost, and has higher practical value.

The image cropping method provided in the embodiment of the present application and shown in fig. 2 or fig. 3 may be packaged in software or a plug-in through a Software Development Kit (SDK), or may be loaded in a network server, and provides a network service interface for a user to use. The user uploads or reads a piece of video (the address of the video can also be specified), specifies the aspect ratio of the cropped video, and then selects the type of output result (such as thermodynamic diagram, coordinates of the cropping frame at the time of cropping, completeness score of the cropped image, and the like). After the service interface is triggered, the target video uploaded or read by the user is cut according to the specified parameters (namely the width-height ratio of the cut video and the like), and an output result of the specified type is returned. The specific process of cutting the target video uploaded or read by the user according to the specified parameters (i.e. the width-to-height ratio of the cut video) is as follows:

as shown in fig. 4f and 4g, an original image 401 is obtained from a target video, and with the original image 401 and a given cropping aspect ratio as input, an invalid edge processing is first performed on the original image 401 (i.e., the original image 401 is subjected to invalid edge detection from four directions, i.e., top, bottom, left and right, and if an invalid edge exists in the original image 401, the invalid edge in the original image 401 is deleted), so as to obtain a target image 402; if the original image 401 has no invalid edge, the original image 401 is directly determined as the target image 402. Then, the target image 402 is subjected to visual saliency analysis processing (for example, a saliency prediction model is called to perform saliency prediction on the target image 402), so that a prediction result image 403 (namely, saliency information of the target image) is obtained, and the prediction result image 403 (namely, a thermodynamic diagram of the target image 402) includes a saliency region 4031. Then, determining a cropping position 404 (i.e. a target position) of the cropping frame in the target image 402 according to the width-to-height ratio of the cropping frame and the prediction result image 403 (the distribution of the salient region in the image), and then performing smoothing processing on the position coordinate of the cropping frame in the target image according to the position coordinate of the cropping frame in the reference image in the target video, specifically:

(1) assuming that the number of images included in the target video is N, it is determined whether N exceeds a threshold S1 (e.g., 1024 in S1). If N > S1, executing step (2); if N < S2, and S2< S1, go to step (3); if S2 is not less than N not more than S1, step (4) is executed.

(2) And (4) judging that the target video is a long video, dividing the long video into a plurality of video clips with the length not exceeding S1 (the video clips comprise the target video clip, namely the video clip to which the target image belongs), and respectively processing each video clip according to the step (4) and the step (5).

(3) And judging that the target video is short, carrying out mean value operation on the target position coordinates of the target image in the target video and the reference position coordinates of the reference image to obtain calibrated position coordinates, cutting the target image and the reference image at the position indicated by the calibrated position coordinates to obtain a cut target video, and ending the processing flow.

(4) And (5) performing one-dimensional Gaussian smoothing processing on the target position coordinate of the target image in the target video and the reference position coordinate of the reference image to obtain an initial smoothing result, and continuing to execute the step (5).

(5) Calculating the moving position of the cropping frame in the f _ n2 th frame in the target video relative to the cropping frame in the f _ n1 th frame (namely calculating the difference value between the target position coordinates of the f _ n2 th frame and the target position coordinates of the f _ n1 th frame), if the moving position exceeds a threshold value t _ s3 (such as 30 pixel points), considering that the motion is severe, and continuing to execute the step (6); if the moving position is less than the threshold t _ s3, the motion is considered to be relatively stable, the moving positions of the cropping frame in the subsequent frame of the f _ n2 frame relative to the cropping frame in the f _ n1 frame are calculated in sequence until the moving position of the cropping frame in the f _ nx frame of the target video relative to the cropping frame in the f _ n1 frame exceeds the threshold t _ s3, the position coordinates of the f _ n1 frame to the f _ nx-1 frame after the line one-dimensional Gaussian smoothing processing in the target video are subjected to the average operation to obtain the position coordinates of the f _ n1 frame to the f _ nx-1 frame after the calibration in the target video, and the step (6) is continuously executed.

(6) And f _ n1 is updated to f _ nx, if the video frame which is not processed in the step (5) exists in the target video, the step (5) is continuously executed, if the video frame which is not processed in the step (5) does not exist in the target video, the target image and the reference image are cut at the position indicated by the calibrated position coordinates to obtain the cut target video, and the current processing flow is ended.

Smoothing the position coordinates of the cutting frame in the target image through the steps (1) to (6), so that the coordinates of the target position can be smoother in time sequence, and cutting the target image 402 according to the smoothing result to obtain a cut image 405; and cutting other images except the target image included in the target video according to the same mode to obtain a cut video. It should be noted that the smoothing process of the position coordinates of the crop box in the target image is performed after the position coordinates of the crop box in the target image are determined and the position coordinates of the crop box in each reference image are determined.

Further, the cut video content integrity score can be calculated and output according to the user requirements. Therefore, a user can cut batch videos or images (a transverse screen is changed into a vertical screen, and the like) through the service interface, and whether the cut images or videos need to be adjusted manually or not is judged according to an output result (such as integrity score).

While the method of the embodiments of the present application has been described in detail above, to facilitate better implementation of the above-described aspects of the embodiments of the present application, the apparatus of the embodiments of the present application is provided below accordingly.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an image cropping device provided in an embodiment of the present application, where the image cropping device may be mounted on a computer device in the foregoing method embodiment, and the computer device may specifically be a terminal or a server with an image processing capability. The image cropping means shown in fig. 5 may be used to perform some or all of the functions in the method embodiments described above with respect to fig. 2 and 3. Wherein, the detailed description of each unit is as follows:

an obtaining unit 501, configured to obtain a target image to be processed, and determine a cropping frame for cropping the target image;

a processing unit 502, configured to perform visual saliency prediction on the target image to obtain saliency information of the target image, where the saliency information is used to indicate a distribution of a saliency region in the target image, and the saliency region is a region in the target image that can attract the attention of a user;

the processing unit 502 is further configured to determine, according to the saliency information, a target position of the crop box in the target image, where the target position is: when the attribute of the saliency region included in the crop box meets the attribute condition, the position of the crop box in the target image;

the processing unit 502 is further configured to perform clipping processing on the target image at the target position by using the clipping frame, so as to obtain a clipped image.

In an embodiment, the processing unit 502 is configured to determine, according to the saliency information, a target position of the crop box in the target image, and specifically, to:

determining a sliding direction of the cutting frame in the target image;

In an embodiment, the processing unit 502 is configured to determine a sliding direction of the crop box in the target image, and specifically is configured to:

In an embodiment, the processing unit 502 is configured to calculate an attribute of a salient region included in the crop box at each candidate position, and specifically, to:

In one embodiment, the target image includes P rows × Q columns of pixel points, and values of P and Q are positive integers; the sliding direction comprises a horizontal sliding direction or a vertical sliding direction; the processing unit 502 is configured to project, according to the sliding direction, the saliency probability values corresponding to the pixel points of the target image into the target image to obtain a projection curve, and specifically configured to:

In an embodiment, after obtaining the cropped image, the processing unit 502 is further configured to:

In one embodiment, the target image is any frame image in a target video segment; the processing unit 502 is configured to perform clipping processing on the target image at the target position by using the clipping frame, so as to obtain a clipped image, and specifically configured to:

In an embodiment, the processing unit 502 is configured to perform calibration processing on the target position coordinates according to each reference position coordinate to obtain calibrated position coordinates, and specifically configured to:

In one embodiment, the smoothing result comprises: the smoothed coordinates of the reference positions and the smoothed coordinates of the target position coordinates are obtained; the processing unit 502 is configured to determine the calibrated position coordinate according to the smoothing processing result, and specifically configured to:

In an embodiment, the processing unit 502 is configured to acquire a target image to be processed, and specifically configured to:

acquiring an initial image;

In an embodiment, the processing unit 502 is configured to perform invalid edge detection on the initial image, and specifically configured to:

In an embodiment, the processing unit 502 is configured to, if the value is greater than the first threshold, determine that an invalid edge exists in the mth detection direction in the initial image, and specifically, configured to:

In one embodiment, the target image is an image in a target video, the target video includes N frames of images, N is a positive integer; the processing unit 502 is configured to perform visual saliency prediction on the target image to obtain saliency information of the target image, and specifically configured to:

In one embodiment, the significance prediction model further comprises a convolution gaussian layer, wherein the convolution gaussian layer is obtained based on a plurality of gaussian kernel training with different variance sizes; the processing unit 502 is configured to fuse the time-sequence saliency result and the spatial saliency result to obtain saliency information of the target image, and specifically configured to:

According to an embodiment of the present application, some steps involved in the image cropping methods shown in fig. 2 and 3 may be performed by respective units in the image cropping device shown in fig. 5. For example, step S201 shown in fig. 2 may be performed by the acquisition unit 501 shown in fig. 5, and steps S202 to S204 may be performed by the processing unit 502 shown in fig. 5. Step S301 shown in fig. 3 may be performed by the acquisition unit 501 shown in fig. 5, and steps S302 to S309 may be performed by the processing unit 502 shown in fig. 5. The units in the image cropping device shown in fig. 5 may be respectively or entirely combined into one or several other units to form the image cropping device, or some unit(s) may be further split into multiple functionally smaller units to form the image cropping device, which may achieve the same operation without affecting the achievement of the technical effects of the embodiments of the present application. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present application, the image cropping device may also include other units, and in practical applications, these functions may also be implemented with the assistance of other units, and may be implemented by cooperation of a plurality of units.

According to another embodiment of the present application, the image cropping apparatus as shown in fig. 5 may be configured by running a computer program (including program code) capable of executing the steps involved in the respective methods as shown in fig. 2 and 3 on a general-purpose computing device such as a computer including a processing element such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and a storage element, and the image cropping method of the embodiment of the present application may be implemented. The computer program may be recorded on a computer-readable recording medium, for example, and loaded and executed in the above-described computing apparatus via the computer-readable recording medium.

Based on the same inventive concept, the principle and the advantageous effect of the image cropping device provided in the embodiment of the present application for solving the problem are similar to those of the image cropping device in the embodiment of the present application for solving the problem, and for brevity, the principle and the advantageous effect of the implementation of the method can be referred to, and are not described herein again.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure, where the computer device at least includes a processor 601, a communication interface 602, and a memory 603. The processor 601, the communication interface 602, and the memory 603 may be connected by a bus or other means. The processor 601 (or Central Processing Unit, CPU) is a computing core and a control core of the terminal, and can analyze various instructions in the terminal and process various data of the terminal, for example: the CPU can be used for analyzing a power-on and power-off instruction sent to the terminal by a user and controlling the terminal to carry out power-on and power-off operation; the following steps are repeated: the CPU may transmit various types of interactive data between the internal structures of the terminal, and so on. The communication interface 602 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI, mobile communication interface, etc.), and may be controlled by the processor 601 to transmit and receive data; the communication interface 602 can also be used for transmission and interaction of data inside the terminal. A Memory 603(Memory) is a Memory device in the terminal for storing programs and data. It is understood that the memory 603 herein may comprise a built-in memory of the terminal, and may also comprise an extended memory supported by the terminal. The memory 603 provides storage space that stores the operating system of the terminal, which may include, but is not limited to: android system, iOS system, Windows Phone system, etc., which are not limited in this application.

In the embodiment of the present application, the processor 601 is configured to execute the following operations by executing the executable program code in the memory 603:

acquiring a target image to be processed through a communication interface 602, and determining a cutting frame for cutting the target image;

As an alternative embodiment, the specific implementation manner of the processor 601 determining the target position of the crop box in the target image according to the saliency information is as follows:

determining a sliding direction of the cutting frame in the target image;

As an alternative embodiment, the specific implementation manner of the processor 601 for determining the sliding direction of the crop box in the target image is as follows:

As an alternative embodiment, the saliency information includes saliency probability values of respective pixel points in the target image; the attribute of the saliency area included by the crop box at any candidate position includes: calculating the significance according to the significance probability value of each pixel point included by the cutting box in any candidate position;

As an alternative embodiment, the specific implementation manner of the processor 601 for calculating the attribute of the saliency area included in the crop box at each candidate position is as follows:

As an optional embodiment, the target image includes P rows × Q columns of pixel points, and values of P and Q are positive integers; the sliding direction comprises a horizontal sliding direction or a vertical sliding direction; the processor 601 projects the saliency probability values corresponding to each pixel point of the target image into the target image according to the sliding direction, and a specific implementation manner of obtaining a projection curve is as follows:

if the sliding direction is the vertical sliding direction, sequentially obtaining the sum of significance probability values of all pixel points in a P-th row in the target image, and using the sum as a projection point of the P-th row, wherein P belongs to [1, P ]; and adopting projection points of each row in the target image to perform curve drawing to obtain a projection curve. As an alternative embodiment, after obtaining the cropped image, the processor 601 is further configured to:

As an alternative embodiment, the target image is any frame image in the target video segment; the specific implementation manner of the processor 601, at the target position, performing clipping processing on the target image by using the clipping frame to obtain a clipped image is as follows:

As an alternative embodiment, the specific implementation manner of the processor 601 performing calibration processing on the target position coordinates according to each reference position coordinate to obtain calibrated position coordinates is as follows:

As an alternative embodiment, the smoothing result includes: the smoothed coordinates of the reference positions and the smoothed coordinates of the target position coordinates are obtained; the specific implementation of the processor 601 determining the calibrated position coordinates according to the smoothing result is as follows:

As an alternative embodiment, the specific implementation manner of the processor 601 obtaining the target image to be processed is as follows:

acquiring an initial image;

As an alternative embodiment, the specific implementation manner of the processor 601 performing invalid edge detection on the initial image is as follows:

As an alternative embodiment, if the value is greater than the first threshold, the specific implementation manner of the processor 601 determining that the initial image has an invalid edge in the mth detection direction is as follows:

As an alternative embodiment, the target image is an image in a target video, the target video includes N frames of images, and N is a positive integer; the processor 601 performs visual saliency prediction on the target image, and the specific implementation manner of obtaining the saliency information of the target image is as follows:

As an alternative embodiment, the significance prediction model further includes a convolution gaussian layer, and the convolution gaussian layer is obtained based on a plurality of gaussian kernels with different variance sizes; the specific implementation manner of the processor 601 fusing the time sequence saliency result and the spatial saliency result to obtain the saliency information of the target image is as follows:

Based on the same inventive concept, the principle and the beneficial effect of the problem solving of the computer device provided in the embodiment of the present application are similar to the principle and the beneficial effect of the problem solving of the image cropping method in the embodiment of the present application, and for brevity, the principle and the beneficial effect of the implementation of the method can be referred to, and are not described herein again.

The embodiment of the present application further provides a computer-readable storage medium, where one or more instructions are stored in the computer-readable storage medium, and the one or more instructions are adapted to be loaded by a processor and execute the image cropping method according to the above method embodiment.

Embodiments of the present application further provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the image cropping method according to the above method embodiments.

Embodiments of the present application also provide a computer program product or a computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the method for image cropping.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.

The modules in the device can be merged, divided and deleted according to actual needs.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, which may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An image cropping method, characterized in that it comprises:

2. The method of claim 1, wherein said determining a target location of the crop box in the target image based on the saliency information comprises:

determining a sliding direction of the cutting frame in the target image;

3. The method of claim 2, wherein said determining a sliding direction of the crop box in the target image comprises:

4. The method of claim 2, wherein the saliency information comprises a saliency probability value for each pixel point in the target image; the attribute of the saliency area included by the crop box at any candidate position includes: calculating the significance according to the significance probability value of each pixel point included by the cutting box in any candidate position;

5. The method of claim 4, wherein said calculating attributes of salient regions included by the crop box at respective candidate locations comprises:

6. The method of claim 5, wherein the target image comprises P rows by Q columns of pixel points, and the values of P and Q are both positive integers; the sliding direction comprises a horizontal sliding direction or a vertical sliding direction; according to the sliding direction, projecting the significance probability value corresponding to each pixel point of the target image into the target image to obtain a projection curve, including:

7. The method of claim 4, wherein after obtaining the cropped image, the method further comprises:

8. The method of claim 1, wherein the target image is any frame image in a target video segment; the step of cutting the target image at the target position by using the cutting frame to obtain a cut image comprises the following steps:

9. The method of claim 8, wherein said calibrating said target location coordinates based on respective reference location coordinates to obtain calibrated location coordinates comprises:

10. The method of claim 9, wherein the smoothing results comprise: the smoothed coordinates of the reference positions and the smoothed coordinates of the target position coordinates are obtained; the determining the calibrated position coordinates according to the smoothing processing result includes:

11. The method of claim 1, wherein the acquiring the target image to be processed comprises:

acquiring an initial image;

12. The method of claim 11, wherein the performing invalid edge detection on the initial image comprises:

13. The method of claim 12, wherein determining that the initial image has an invalid edge in the mth detection direction if the value is greater than the first threshold comprises:

14. The method of any one of claims 1-13, wherein the target image is an image in a target video, the target video comprising N frames of images, N being a positive integer; the predicting the visual saliency of the target image to obtain the saliency information of the target image comprises the following steps:

15. The method of claim 14, wherein the significance prediction model further comprises a convolutional gaussian layer trained based on a plurality of gaussian kernels of different variance sizes; the fusion of the time sequence significance result and the space significance result to obtain the significance information of the target image comprises the following steps:

16. An image cropping device, characterized in that the image cropping device comprises:

17. A computer device, comprising: a storage device and a processor;

the storage device stores a computer program therein;

a processor executing a computer program implementing the image cropping method of any of claims 1-15.

18. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, causes the image cropping method according to any one of claims 1 to 15 to be carried out.