CN113724153A

CN113724153A - Method for eliminating redundant images based on machine learning

Info

Publication number: CN113724153A
Application number: CN202110880980.9A
Authority: CN
Inventors: 周啸宇; 石爻; 倪志彬; 梁淇奥; 蒋新科; 向芝莹; 李顺; 何震宇; 左健甫; 杨若辰; 吴世涵; 张恩华; 吉雪莲; 常世晴; 罗佳源; 陈攀宇; 王瑞锦
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-08-02
Filing date: 2021-08-02
Publication date: 2021-11-30

Abstract

The invention discloses a method for eliminating redundant images based on machine learning, which comprises the following steps: firstly, the method comprises the following steps: acquiring an original image with a plurality of redundant persons, and detecting and extracting the redundant persons in the original image; II, secondly: respectively labeling the characters detected and extracted from the image, cutting out the labeled redundant characters, and reserving the target character image; thirdly, the method comprises the following steps: adopting an image cutting and repairing model to carry out image repairing and completion on the cut image to obtain an image for eliminating redundant characters; the first step specifically comprises the following steps: when redundant characters and small targets exist in the obtained multiple original images, the original images are input into a target detection extraction model for target detection, and the detected redundant characters and small targets are subjected to target segmentation. The invention has the advantage of automatically eliminating passerby, adopts a high-precision target detection algorithm to extract the figure in the image for clipping, and simultaneously adopts an advanced image restoration technology to make the image as close as possible to reality.

Description

Method for eliminating redundant images based on machine learning

Technical Field

The invention relates to the technical field of image processing, in particular to a method for eliminating redundant images based on machine learning.

Background

In the shooting behaviors in daily life, the problems that passerby suddenly intrudes into the shooting range or more people cause the image composition to be damaged and the image is not beautiful exist. Meanwhile, photography behaviors inevitably involve portrait rights, privacy rights and reputation rights of a shot person all the time, meanwhile, the spread range of media contents is expanded due to the arrival of the internet era, and the autonomous photography or spread behaviors of citizens are likely to violate relevant regulations such as portrait rights, digital content copyright protection and the like.

At present, people who want to cut and delete redundant characters from photos are cut, and the most popular method is manual removal by using a Photoshop and Adobe tool, but the method depends on manual operation and cannot realize automation. While Mate 40Pro self-carrying person in hua eliminated the function, the luminance was reduced as a result of the processing. The beautiful figure show also provides the function of eliminating passerby, but relatively well represents people with small area and long distance, and represents worse when the area occupied by redundant people in the picture is large, and particularly represents that the picture is distorted and the picture is not beautiful.

For example, patent application No. CN202010418692.7 discloses an auxiliary photographing terminal and photographing processing method based on AI technology, which includes: the device comprises a photographing module, a detection module, a selection module and an elimination module, wherein the photographing module is connected with the detection module, the selection module is connected with the detection module, and the elimination module is connected with the selection module; the photographing module is connected with the detection module, and the detection module is used for acquiring at least one first image in the images to be processed; the detection module can also acquire at least one second image in the images to be processed, and detect the first image and the second image to obtain the relative relation between the person and the object in the background. Although the technical scheme can process the first image and the second image so that a user can timely acquire an image expected by the user when taking a picture, the unprocessed image is repaired, and the image processing effect is poor.

Patent application No. CN201910204840.2 discloses a photographing processing method, apparatus, mobile terminal and storage medium, the method includes: the method comprises the steps of collecting a preview image, inputting the preview image into a trained image classification model, obtaining information output by the trained image classification model, inputting the preview image into the trained image generation model when the information is read to include a shielding object in the preview image, and obtaining a target image output by the trained image generation model, wherein the target image is an image which is obtained after the trained image generation model repairs the preview image and does not include the shielding object. Although the scheme classifies the preview image and generates the target image through the image classification model and the image generation module, in the image restoration process, the countermeasure type generation network is not adopted to realize the edge generation of fine granularity of the image and the internal image restoration, and the image restoration effect is to be improved.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a method for eliminating redundant characters in an image based on machine learning, can effectively detect, identify and remove redundant character contents in the image, solves the problems of unattractive image and infringement possibly caused by shooting due to excessive redundant characters in the image when the image is shot, and has important significance for beautifying and preventing infringement of the image.

The purpose of the invention is realized by the following technical scheme:

a method for eliminating redundant images based on machine learning comprises the following steps:

the method comprises the following steps: acquiring an original image with a plurality of redundant persons, and detecting and extracting the redundant persons in the original image;

step two: respectively labeling the characters detected and extracted from the image, cutting out the labeled redundant characters, and reserving the target character image;

step three: and (4) carrying out image restoration and completion on the cut image by adopting an image cutting and restoration model to obtain an image for eliminating redundant characters.

Specifically, the first step specifically comprises: when the obtained multiple original images are read to have redundant characters and small targets, the original images are input into a target detection extraction model, the redundant characters and the small targets in the original images are subjected to target detection, and the detected redundant characters and the small targets are subjected to target segmentation.

Specifically, the process of detecting the target of the redundant person and the small target in the original image specifically includes the following substeps:

s11, performing Mosaic data enhancement, randomly reading four pictures in a plurality of original images each time, then respectively performing turning, scaling and color gamut changing operations on the four pictures, splicing the four operated pictures, and finally performing picture combination and frame combination;

s12, calculating a self-adaptive anchor frame, and constructing a training sample by using the offset of the real frame position relative to a preset frame when network training of a target detection extraction model is carried out on the basis of the predefined frame; in the gesture recognition network calculation, the network outputs a prediction frame on the basis of an initial anchor frame, then compares the prediction frame with a real frame, calculates the difference between the prediction frame and the real frame, and then reversely updates and iterates network parameters;

s13, zooming the self-adaptive picture of multi-scale training, respectively calculating the zoom ratio, the zoomed standard size and the black edge filling value, uniformly zooming the original picture to the preset standard size according to the zoom ratio, carrying out self-adaptive gray edge filling on the original picture according to the black edge filling value, and inputting the filled picture into the FPN + PAN pyramid structure for accurate character recognition.

Specifically, the second step specifically comprises: and respectively labeling the small targets of the characters and the gestures detected and extracted in the image, and cutting the marked redundant characters and gestures by using DropBlock to obtain a cut target character image.

Specifically, the third step specifically comprises: adopting an edge-linked confrontation type generation network to realize edge generation of fine granularity of the image and internal image restoration; the task of edge generation and image completion is completed through a two-stage network framework, the edge and internal texture of the image are repaired in the first stage, and the color filling and image repair of the whole line framework are completed in the second stage.

Specifically, the process of repairing the edge and the internal texture of the image in the first stage includes: and constructing an image line restorer module, wherein the module comprises an encoder for twice downsampling, eight residual blocks and a deconvolution decoder, the residual blocks use expansion convolution with an expansion factor of 2 to replace regular convolution, and the image line restorer module is used for carrying out texture restoration on the cut target person image.

Specifically, the process of completing the color filling and image restoration of the whole line frame at the second stage specifically includes: constructing a pixel-level generating network architecture, inputting an RGB channel, a mask channel and a texture line channel of a missing edge by a network, taking pixels input by each channel as effective input, and providing a dynamic feature selection mechanism for all position points of all layers; reconstructing by adopting the edge of the incomplete image repaired in the first stage and a background image; and constructing an image restorer by adopting a joint loss function of perception loss and style loss to realize antagonistic fine-grained restoration and completion of the incomplete image.

The invention has the beneficial effects that: the invention has the advantage of automatically eliminating passerby, adopts a high-precision target detection algorithm to extract the figure in the image for clipping, and simultaneously adopts an advanced image restoration technology to make the image as close as possible to reality. The experimental result shows that in a well-prepared data set, the quantitative indexes IS and FID are higher than those of the prior technical scheme, the image distortion degree IS low, and the picture IS more attractive.

Drawings

FIG. 1 is a flow chart of the method steps of the present invention.

Fig. 2 is a general flow diagram of the method of the present invention.

Figure 3 is a schematic diagram of the operation of the generator network of the present invention.

Fig. 4 is a schematic diagram of the arbiter network operation of the present invention.

Detailed Description

In order to clearly understand the technical features, objects and effects of the present invention, the following detailed description of the embodiments of the present invention is provided in conjunction with the technical solutions and the accompanying drawings, in which only some embodiments but not all embodiments are described in detail, and other embodiments obtained by innovative modifications by other persons in the field are within the protection scope of the present invention.

The first embodiment is as follows:

in this embodiment, as shown in fig. 1 and fig. 2, a method for eliminating unnecessary image people based on machine learning includes the following steps:

According to the invention, an end-to-end image cutting and repairing model is developed, and the model adopts a yolo image segmentation algorithm in combination with a generation type antagonistic image repairing, so that automatic elimination and natural repairing of passers-by in the image are realized. The invention mainly carries out image processing flows of two aspects; the method comprises the steps of small target extraction and detection based on yolov5, and fine-grained edge generation and image restoration based on a countermeasure network.

The first part is object extraction and detection, and the part mainly applies yolov5 algorithm. This section mainly performs object detection on unnecessary persons in the image.

The second part is fine-grained edge generation and image inpainting based on the countermeasure network. The image restoration method mainly repairs the cut image to restore the image to be close to a real image as far as possible.

The GAN used here is mainly composed of two networks: a generator network and a discriminator network.

First, there is a generator, which inputs a set of vectors and generates a set of target matrices via the generator, and the purpose of this is to make the sample-making capability as strong as possible, and the operation flow is shown in fig. 3. At the same time, GAN has a Discriminator: the purpose of the discriminator is to discriminate whether a graph is from a set of true samples or a set of false samples. If the input is a true sample, the network output is close to 1, the input is a false sample, and the network output is close to 0, then the purpose of good discrimination is achieved, and the operation flow of the discriminator network is shown in fig. 4.

The invention designs and develops an end-to-end image cutting and repairing model, and the model adopts a yolo image segmentation algorithm combined with a generation type antagonistic image repairing to realize automatic elimination and natural repair of passers-by in the image. The model mainly realizes the following functions: 1) and Mosaic data enhancement: the detection effect of small targets and the robustness of a network are enhanced; 2) CLOU _ LOSS: measuring the similarity of the aspect ratio of the picture, and better and faster coinciding the anchor frame; 3) accurate positioning and identification based on FPN + PAN: the gesture recognition and positioning performance is improved; 4) drop Block: the generalization performance is better, and the recognition accuracy is improved; 5) a generating background restoration module: the generation type background restoration module takes the intermediate image generated by the target segmentation as input and generates a natural scene image with the eliminated target character through two stages of a texture restorer and an image restorer.

And enhancing the Mosaic data. In training the data set, it can be found that the data set is often small and the small targets occupy most of the data set. The motion 3/5 in the data set collected by the camera will occur with a single hand, a small object is detected as a hand, and the hand will cause non-uniformity in the distribution of the small object during movement. The uneven distribution of small targets can bring great instability to the network, so that a method of Mosaic data enhancement can be adopted.

The Mosaic data enhancement refers to a CutMix data enhancement mode, wherein CutMix data enhancement only uses two pictures and mixes the pictures in a mode of patching a cut part area. The Mosaic data enhancement is to randomly read four pictures from a data set each time, then respectively perform operations such as turning, scaling and color gamut change on the four pictures, then splice the four pictures, and finally perform picture combination and frame combination. Thus, the detection effect of the small target and the robustness of the network are enhanced to a certain extent. Of course, the advantages of Mosaic are not limited thereto. Because the data of four pictures are directly calculated during the enhancement training of the Mosaic data set, the size of the Mini-batch does not need to be large, and the GPU can achieve a good effect.

And (4) self-adaptive anchor frame calculation. The self-adaptive anchor frame is based on a predefined frame, and a training sample is constructed according to the offset of the real frame position relative to the preset frame during network training. In the gesture recognition network calculation, the network outputs a prediction frame on the basis of an initial anchor frame, then compares the prediction frame with a real frame, calculates the difference between the prediction frame and the real frame, and then reversely updates and iterates network parameters.

An adaptive anchor block is used for the training phase and the prediction phase. In the training stage, the anchor frame is used as a training sample, and in order to train the sample, we need to label two types of labels for each anchor frame: the type of the target contained in the anchor frame, and the offset of the real boundary frame relative to the anchor frame. During gesture recognition detection, firstly a plurality of anchor frames are generated, then the category and the offset of each anchor frame are predicted, then the positions of the anchor frames are adjusted according to the predicted offset so as to obtain a predicted boundary frame, and finally the predicted boundary frame needing to be output is screened.

In the prediction stage, firstly, a plurality of anchor frames are generated in a video real-time stream capture picture, and then the category and the offset are predicted according to a trained model, so that a predicted boundary frame is obtained. The adaptive anchor frame calculation will greatly improve the stability of the output results of the network.

Adaptive picture scaling for multi-scale training. In the process of collecting the data set, the lengths of the collected pictures are not very same, so the common mode is to uniformly scale the original pictures into a standard size and then send the standard size to the detection network.

However, the uniform scaling to the standard size may result in padding, for example, all images are scaled to 416 × 416, and if the black edges at both ends are different in size after padding, or there are more padding, there is information redundancy, which affects the inference speed.

The adaptive picture scaling of the multi-scale training adaptively fills gray edges in the original image. In such a way, the filling amount of the gray edges is smaller, and the calculation amount is reduced during network training, namely the target detection speed is improved. The method comprises the following steps of: 1. and calculating a scaling ratio 2, calculating a scaled size 3, and calculating a black edge filling value. The model does not have a form of reducing black edges on the photos of the gestures in the training process, but the form is adopted only during testing and model reasoning, so that the gesture detection speed and the model accuracy can be improved to a certain extent.

Example two:

in this embodiment, on the basis of the method for eliminating unnecessary persons from an image provided in the first embodiment, the CLOU _ LOSS, which is an anchor frame matching LOSS function in consideration of special situations, is introduced in the present embodiment. When locating an anchor frame, we compute the intersection ratio (IOU) and the penalty function of the predicted anchor frame and the real anchor frame. However, there are two problems with this: 1, IoU _ Loss cannot optimize the condition that the prediction frame and the real frame are not intersected, when the value of IoU _ Loss is 1 and the derivative is 0, the problem 2 that the gradient disappears occurs, when the intersection and comparison ratio is the same, IoU _ Loss cannot distinguish the overlapping condition of the real frames of the prediction frame, which leads us to be unable to determine the moving direction of the prediction anchor frame.

To solve the above two problems, we introduce GIoU _ Loss, i.e. add a difference between two anchor frames, which is equivalent to add a penalty term:

however, there is still a problem that when the predicted anchor frame is located inside the real anchor frame, the moving direction cannot be known, that is, when the anchor frame is moved inside, the difference size does not change. The constraint of the central point positions of the two frames is added, at the moment, the centers of the two anchor frames are overlapped, and then the length and the width are adjusted to be equal:

where c is the length of the diagonal, the numerator is the square of the two center points, and α × v is added on the basis of DIoU, which can measure the similarity of the aspect ratio of the picture, and can better and faster coincide the anchor frame, we call CIoU: v: for measuring the similarity of aspect ratios; α: and (4) weighting.

Based on FPN + PAN accurate positioning and identification. PAN + FPN is based on an improved pyramid structure on FPN and MaskRCNN models, improves a backbone network structure, strengthens the structure of a feature pyramid, and shortens a path for high-low layer feature fusion. The FPN is top-down, and a feature map for prediction is obtained by fusing the upper-layer features and the lower-layer features. A bottom-up pyramid of features is also added behind the FPN layer. Through the combination operation, the FPN layer transmits strong semantic features from top to bottom, the feature pyramid transmits strong positioning features from bottom to top, and parameter aggregation is performed on different detection layers from different trunk layers. The FPN is from top to bottom, high-level strong semantic features are transmitted to enhance the whole pyramid, but only semantic information is enhanced, positioning information is not transmitted, and a pyramid from bottom to top is added behind the FPN to convey the strong positioning features. The improved pyramid structure well improves the gesture recognition and positioning performance.

DropBlock — reinforced web learning feature. In the conventional network, Dropout is used to drop pixels in an image, but if the pixels are not activated, the network learns the same information from units near the dropped activated units, so that the network lacks certain generalization. DropBlock is to discard the whole pixel unit of a certain block region, so that the network cannot learn the information of the region, and the network is forced to pay attention to the characteristics of other regions of the learning image to realize correct classification, thereby showing better generalization and improving the recognition accuracy.

The embodiment can achieve the following technical effects:

the method has the advantages of automatically eliminating passerby, extracting the figure in the image by adopting a high-precision target detection algorithm for clipping, and simultaneously enabling the image to be close to reality as much as possible by adopting an advanced image restoration technology. The experimental result shows that in a well-prepared data set, the quantitative indexes IS and FID are higher than those of the prior technical scheme, the image distortion degree IS low, and the picture IS more attractive.

Example three:

in this embodiment, based on the method for eliminating unnecessary characters in the basic image provided in the first and second embodiments, in order to automatically repair and complement the image after the removal of the character, an edge-linked confrontation-type generation network is used to implement fine-grained edge generation and internal image repair of the image. The task of edge generation and image completion is completed through a two-stage network framework, the edge and internal texture of the image are repaired in the first stage, and the color filling and image repair of the whole line framework are completed in the second stage.

In this embodiment, in the texture restoration stage, in consideration of the fact that the conventional edge-to-image method cannot depict fine-grained image detail features, an image line restorer module is adopted in the design. The module comprises an encoder for twice down-sampling, eight residual blocks and a deconvolution decoder. The residual block portion uses a dilation convolution with a dilation factor of 2 instead of a regular convolution so that different scales of granularity information can be contained.

In this embodiment, in the image restoration stage, a pixel-level generative network architecture is adopted, an RGB channel with a missing edge, a mask channel, and a texture line channel are input through a network, and pixels input by each channel are used as effective inputs, so that a dynamic feature selection mechanism for learning is provided for all position points of all layers. And reconstructing by adopting the incomplete image edge of the previous stage and the background image. And the loss function adopts a combined loss function of perception loss and style loss to construct an image restorer to realize antagonistic fine-grained restoration and completion of the incomplete image.

In this embodiment, the perceptual loss function is defined as follows:

wherein I_inputAnd I_predictio_nRespectively input and output, alpha_iIs an activation map of the i-th layer of the preprocessing network.

The style loss function is defined as follows:

wherein J_jIs by activation map, α_iAnd constructing a j multiplied by j Gram matrix.

The overall loss function is shown below: l is_Loss＝λ_pΔL_p+λ_sΔL_s

Wherein L is_pAnd L_sRespectively representing the perception loss function and the style loss function, wherein lambda is a balance adjustment parameter, and balances the magnitudes of the perception loss and the style loss.

The embodiment can achieve the following technical effects:

in the embodiment, the pixel-level generation type network architecture is adopted, the incomplete image edge and the background image in the previous stage are reconstructed, a joint loss function of perception loss and style loss is designed to optimize the network architecture, and an image restorer is constructed to realize antagonistic fine-grained restoration and completion of the incomplete image, so that the image restoration precision and the image restoration effect of the basic image redundant character elimination method are greatly improved, and the restored image is closer to a real image.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A method for eliminating redundant images based on machine learning is characterized by comprising the following steps:

2. The method of claim 1, wherein the step one specifically comprises: when redundant characters and small targets exist in the obtained multiple original images, the original images are input into a target detection extraction model, target detection is carried out on the redundant characters and the small targets in the original images, and target segmentation is carried out on the detected redundant characters and the small targets.

3. The method as claimed in claim 2, wherein the process of detecting the unwanted figure and small object in the original image specifically includes the following sub-steps:

4. The method for eliminating the unnecessary image people based on the machine learning as claimed in claim 1, wherein the second step specifically comprises: and respectively labeling the small targets of the characters and the gestures detected and extracted in the image, and cutting the marked redundant characters and gestures by using DropBlock to obtain a cut target character image.

5. The method for eliminating the unnecessary image people based on the machine learning as claimed in claim 1, wherein the third step specifically comprises: adopting an edge-linked confrontation type generation network to realize edge generation of fine granularity of the image and internal image restoration; the task of edge generation and image completion is completed through a two-stage network framework, the edge and internal texture of the image are repaired in the first stage, and the color filling and image repair of the whole line framework are completed in the second stage.

6. The method as claimed in claim 5, wherein the first stage of completing the edge and inner texture repairing process for the image comprises: and constructing an image line restorer module, wherein the module comprises an encoder for twice downsampling, eight residual blocks and a deconvolution decoder, the residual blocks use expansion convolution with an expansion factor of 2 to replace regular convolution, and the image line restorer module is used for carrying out texture restoration on the cut target person image.

7. The method as claimed in claim 5, wherein the second stage of completing the color filling and image inpainting process of the whole line frame includes: constructing a pixel-level generating network architecture, inputting an RGB channel, a mask channel and a texture line channel of a missing edge by a network, taking pixels input by each channel as effective input, and providing a dynamic feature selection mechanism for all position points of all layers; reconstructing by adopting the edge of the incomplete image repaired in the first stage and a background image; and constructing an image restorer by adopting a joint loss function of perception loss and style loss to realize antagonistic fine-grained restoration and completion of the incomplete image.