CN115297263B

CN115297263B - Automatic photographing control method and system suitable for cube shooting and cube shooting

Info

Publication number: CN115297263B
Application number: CN202211020415.6A
Authority: CN
Inventors: 朱锦钊; 于鹏; 刘帅; 林铠骏
Original assignee: Guangzhou Fangtu Technology Co ltd
Current assignee: Guangzhou Fangtu Technology Co ltd
Priority date: 2022-08-24
Filing date: 2022-08-24
Publication date: 2023-04-07
Anticipated expiration: 2042-08-24
Also published as: CN115297263A

Abstract

The invention discloses an automatic photographing control method suitable for a cube, which comprises the following steps: s1, performing gesture detection on the first N frames of images of a video stream acquired by a camera based on a gesture detection model of YOLO to obtain a hand area; n is more than or equal to 3; s2, modeling the target motion state of the hand area by using Kalman filtering, and tracking and predicting the position of the hand area in the next frame in real time to obtain a gesture of an object to be recognized; s3, comparing the gesture of the object to be recognized with a reference gesture in a preset gesture template library to calculate similarity; when the calculated similarity is larger than a set threshold value, judging the gesture of the object to be recognized as a photographing triggering gesture, and starting photographing countdown; otherwise, returning to the step S1; s4, when the photographing countdown is finished, controlling the camera to photograph to obtain an original image; and S5, carrying out face beautifying processing on the original image by using the constructed generative confrontation network to obtain a processed target image and outputting the processed target image.

Description

Automatic photographing control method and system suitable for cube shooting and cube shooting

Technical Field

The invention relates to the technical field of self-service photographing, in particular to an automatic photographing control method and system suitable for a cube and the cube.

Background

The shooting cube, also called intelligent shooting box or intelligent shooting pavilion, is placed in public places with large people flow, such as street, subway stations, etc., is convenient for users to simply and conveniently shoot self-service shooting equipment for high-definition photos or certificate photos, etc., is very popular among people, and is spread in all public places such as commercial department stores, colleges and universities, tourist attractions, stations, airports, etc.

The inventor finds that the conventional shooting cube (self-help shooting device) has a space for further improving the shooting control mode and the quality of the shot image in the process of implementing the invention, and the invention aims to simplify the self-help shooting control mode of the shooting cube and improve the quality of the shot image.

Disclosure of Invention

The invention aims to provide an automatic photographing control method and system suitable for a cube shooting and the cube shooting, which can effectively solve the technical problems in the prior art.

In order to achieve the above object, an embodiment of the present invention provides an automatic photographing control method suitable for a photographing cube, wherein the photographing cube includes a control processor, and a camera respectively communicating with the control processor, and the automatic photographing control method includes:

s1, performing gesture detection on the first N frames of images of a video stream acquired by a camera based on a gesture detection model of YOLO to obtain a hand area; n is more than or equal to 3;

s2, modeling the target motion state of the hand area by using Kalman filtering, and tracking and predicting the position of the hand area in the next frame in real time to obtain a gesture of an object to be recognized;

s3, comparing the gesture of the object to be recognized with a reference gesture in a preset gesture template library to calculate similarity; when the calculated similarity is larger than a set threshold value, judging the gesture of the object to be recognized as a photographing triggering gesture, and starting photographing countdown; otherwise, returning to the step S1;

s4, when the photographing countdown is finished, controlling the camera to photograph to obtain an original image;

and S5, carrying out face beautifying processing on the original image by using the constructed generative confrontation network to obtain a processed target image and outputting the processed target image.

Preferably, in step S1, the gesture detection model includes an input unit, an encoder and a decoder, the input unit is configured to input the first N frames of images of the video stream acquired by the camera, and the encoder is configured to perform feature extraction and feature fusion on the first N frames of images to obtain a feature map with richer information; the decoder decodes the feature map into a detection result, thereby obtaining a hand region;

optimizing a loss function L of the following formula (1-1) based on a preset gesture detection training data set, and continuously performing iterative training by a gradient descent method to obtain an optimal detection result:

wherein S is ² Representing the size of an output layer feature map, namely the number of grids divided by the input image; b is the number of anchors per grid,

indicating whether the jth Anchor in the ith grid is responsible for the object prediction, and selecting the largest one with the real frame IoU as a prediction frame; />

Indicating that the jth Anchor in the ith mesh indicates that this object prediction is not responsible; x is the number of _i ,y _i ,w _i ,h _i And &>

The center coordinates and the width and the height of each real label and the prediction frame are respectively expressed and are between 0 and 1, and x _i ,y _i ,/>

Is the relative cell offset; c _i And &>

Respectively representTrue box confidence and predicted box confidence; p is a radical of _i (c) And &>

Representing the true object class probability and the prediction class probability of the corresponding unit cell; classes represents the number of object categories; lambda [ alpha ] _coord And λ _noobj And respectively representing the loss weight of the coordinate prediction of the bounding box and the confidence prediction loss weight of the bounding box without the object.

Preferably, in step S2, the kalman filtering includes a prediction and an update process, the prediction is to estimate a current state according to a state at a previous time, and the update is to modify the prediction information according to the observation value, so as to estimate an optimal state; the method specifically comprises the following formulas (1-2) to (1-6):

and (3) prediction process:

and (3) updating:

wherein the content of the first and second substances,

representing a prediction estimate derived from a previous state; />

Represents a pair->

An updated optimal estimate; a represents a state transition matrix; h represents an observation matrix; b represents an input control matrix; u. of _t Representing a system external control quantity; q and R respectively represent a dynamic noise covariance matrix and a measurement noise covariance matrix; k _t Representing the Kalman gain at time t; />

Representing a prediction error covariance matrix; sigma _t Representing a filtering error covariance matrix;

according to the process, the state of the tracker is initialized by utilizing a hand region bounding box obtained by detecting the first N frames of images of the video stream acquired by the camera, and then the position of the estimated target in the next frame is continuously predicted, so that real-time tracking is realized.

Preferably, in the step S3, the to-be-recognized picture with the gesture of the to-be-recognized object and the reference gesture picture in the gesture template library are firstly scaled to the same size, then the euclidean distance is calculated according to the following formula (1-7), and the reference gesture picture with the minimum euclidean distance is found out as the reference gesture picture with the highest similarity:

wherein x is a picture to be recognized, T is a reference gesture picture set in a gesture template library, T is an element in the set T, and T is ^* The element with the minimum Euclidean distance;

if t is found ^* Is less than a preset threshold value, t is determined ^* And the similarity between the gesture of the object to be recognized and the picture to be recognized is greater than a set threshold value, so that the gesture of the object to be recognized is judged as a photographing triggering gesture.

Preferably, in the step S5, the generative countermeasure network includes a generator and an arbiter, the generator is implemented as shown in the following formula (2-1),

G(X)＝F(X)+X (2-1)

x is an input original image, F (X) is output obtained after the original image X is processed by a feature learning module F, the feature learning module F mainly adopts a sparse connected parallel structure to extract features, the input original image X is reduced to the original 1/4 size through convolution pooling, then the resolution ratio is kept unchanged on a main path, the resolution ratio on a branch is 1/2 of that of a main branch, high-level semantic features are learned, and finally the high-level semantic features are mixed with the main branch with rich structural information and output;

the input of the discriminator D (X) is X, G (X), after the characteristics are extracted through multilayer convolution, a characteristic diagram with a large receptive field and a small resolution ratio is obtained, and the average value of the characteristic diagram is used as the score of the discriminator.

Preferably, D (X) adopts binary cross entropy as a loss function in the whole model training process; in order to generate a required target image better, the loss target of the generator adds three regularization terms on the basis of GANLOS, calculates the L1 loss of the numerical distance, calculates the perception loss of the multi-level structure similarity and calculates the focal frequency loss of the frequency domain similarity, and specifically comprises the following steps:

the target loss function of the conditional GAN can be expressed as shown in equation (2-2),

L _cGAN (G,D)＝E _x,y [logD(x,y)]+E _x [log(1-D(x,G(x)))] (2-2)

where G attempts to minimize this goal and D attempts to maximize, argmin _G max _D L _cGAN (G, D), x is the input original image, and y is the target image;

in order to ensure the consistency of input and output, L1 loss is introduced to carry out constraint such as formula (2-3),

L _L1 (G)＝E _x,y [||y-G(x)|| ₁ ] (2-3)

in order to prevent the generated image from losing the structural information, a measure of structural similarity is introduced as in equation (2-4),

L _MS-SSIM ＝1-MS_SSIM(G(x),y) (2-4)

wherein the content of the first and second substances,

the MS _ SSIM method takes a reference image and an original image signal as input, and sequentially applies a low-pass filter to carry out down-sampling on the image by 2 times; assuming that the scale of an original image is 1, obtaining the highest scale M through M-1 iterations; on the jth scale, the comparative contrast and structural similarity, denoted c respectively, are calculated by the following equations (2-6) and (2-7) _j (G (x), y) and s _j (G(x),y)：

/>

The luminance comparison is calculated by the formula (2-8) only at the scale M:

wherein, mu _G And mu _y Means, σ, representing the generated image and the label image, respectively _G And σ _y Represents a variance; alpha is alpha _M ,β _j ,γ _j On the corresponding scale, the importance of brightness estimation, contrast and structural similarity components; to simplify the operation, let α _j ＝β _j ＝γ _j ,

C1, C2, C3 are all constants to avoid instability, C ₁ ＝(K ₁ L) ² ,C ₂ ＝(K ₂ L) ² ,C ₃ ＝C ₂ (iv) taking the parameter K ₁ ＝0.01,K ₂ =0.03,l is the dynamic range of pixel values;

to further improve the image quality, the gap between the frequency domains is introduced as a constraint term, as in equations (2-9):

wherein MxN represents the size of an image, w (u, v) is a frequency spectrum weight matrix, (u, v) is a frequency domain coordinate, and y (u, v) and G (u, v) correspond to complex frequency values;

the overall optimization objective of the generator is therefore the following equation (2-10):

L _total ＝L _cGAN +αL _L1 +βL _MS-SSIM +γL _FFL (2-10)

wherein, α, β and γ represent each loss value weight, and are set according to the requirement of actual training, and the larger the weight of the corresponding constraint term is, the larger the influence on the total target loss is; therefore, through a gradient descent method, the objective function is continuously and iteratively optimized to obtain an optimal solution.

The embodiment of the invention correspondingly provides an automatic photographing control system suitable for a cube shooting body, wherein the cube shooting body comprises a control processor, a camera and a camera, the camera and the camera are respectively communicated with the control processor, and the automatic photographing control system comprises:

the gesture detection module is used for carrying out gesture detection on the first N frames of images of the video stream acquired by the camera based on a gesture detection model of YOLO to obtain a hand area; n is more than or equal to 3;

the hand tracking module is used for modeling the target motion state of the hand area by using Kalman filtering, tracking and predicting the position of the hand area in the next frame in real time and obtaining the gesture of the object to be recognized;

the gesture recognition module is used for comparing the gesture of the object to be recognized with a reference gesture in a preset gesture template library so as to calculate the similarity; when the calculated similarity is larger than a set threshold value, judging the gesture of the object to be recognized as a photographing triggering gesture, and starting photographing countdown; otherwise, returning to the step S1;

the shooting control module is used for controlling the camera to shoot when the shooting countdown is finished so as to obtain an original image;

and the image beautification processing module is used for carrying out face beautification processing on the original image by utilizing the constructed generative confrontation network to obtain a processed target image and outputting the processed target image.

Preferably, in the gesture detection module, the gesture detection model includes an input unit, an encoder and a decoder, the input unit is configured to input the first N frames of images of the video stream acquired by the camera, and the encoder is configured to perform feature extraction and feature fusion on the first N frames of images to obtain a feature map with richer information; the decoder decodes the feature map into a detection result, thereby obtaining a hand region;

wherein S is ² Representing the size of the output layer feature map, namely the number of grids divided by the input image; b is the number of anchors per grid,

representing whether the jth Anchor in the ith grid is responsible for the object prediction, selecting the largest one with the real frame IoU as the prediction frame; />

Indicates that the jth Anchor in the ith mesh indicates that this object prediction is not responsible; x is a radical of a fluorine atom _i ,y _i ,w _i ,h _i And &>

Representing each real label and predicted bounding box separatelyHas a central coordinate and a width and a height of 0 to 1, x _i ,y _i ,/>

Is the relative cell offset; c _i And &>

Respectively representing the confidence coefficient of a real frame and the confidence coefficient of a predicted frame; p is a radical of _i (c) And &>

Representing the true object class probability and the prediction class probability of the corresponding unit cell; classes represents the number of object categories; lambda [ alpha ] _coord And λ _noobj Respectively representing the loss weight of the coordinate prediction of the bounding box and the confidence degree prediction loss weight of the bounding box without the object;

in the hand tracking module, kalman filtering comprises a prediction process and an update process, wherein the prediction process is to estimate the current state according to the state of the previous moment, and the update process is to correct the prediction information according to an observation value so as to estimate the optimal state; the method specifically comprises the following formulas (1-2) to (1-6):

and (3) prediction process:

and (3) updating:

wherein the content of the first and second substances,

representing a prediction estimate derived from a previous state; />

Represents a pair->

An updated optimal estimate; a represents a state transition matrix; h represents an observation matrix; b represents an input control matrix; u. of _t Representing a system external control quantity; q and R respectively represent a dynamic noise covariance matrix and a measurement noise covariance matrix; k is _t Representing the Kalman gain at time t; />

according to the process, initializing the state of the tracker by using a hand region bounding box obtained by detecting the first N frames of images of the video stream acquired by the camera, and then continuously predicting the position of the estimated target in the next frame to realize real-time tracking;

in the gesture recognition module, firstly, a picture to be recognized with the gesture of the object to be recognized and a reference gesture picture in a gesture template library are scaled to be the same in size, then, euclidean distance is calculated according to the following formula (1-7), and the reference gesture picture with the minimum Euclidean distance is found out and used as the reference gesture picture with the highest similarity:

wherein x represents a picture to be recognized, and T represents a reference in a gesture template libraryA set of gesture pictures, T represents an element in the set T, T ^* The element with the minimum Euclidean distance;

if t is found ^* Is less than a preset threshold value, t is determined ^* And the similarity of the image to be recognized and the gesture of the object to be recognized is greater than a set threshold value, so that the gesture of the object to be recognized is determined as a photographing triggering gesture.

Preferably, in the image beautification processing module, the generative countermeasure network comprises a generator and a discriminator, the generator is implemented as shown in the following formula (2-1),

G(X)＝F(X)+X (2-1)

the input of the discriminator D (X) is X, G (X), after the characteristics are extracted through multilayer convolution, a characteristic diagram with a large receptive field and a small resolution ratio is obtained, and the average value of the characteristic diagram is used as the score of the discriminator;

in the whole model training process, D (X) adopts binary cross entropy as a loss function; in order to generate a required target image better, the loss target of the generator adds three regularization terms on the basis of GANLOS, calculates the L1 loss of the numerical distance, calculates the perception loss of the multi-level structure similarity and calculates the focal frequency loss of the frequency domain similarity, and specifically comprises the following steps:

L _cGAN (G,D)＝E _x,y [logD(x,y)]+E _x [log(1-D(x,G(x)))] (2-2)

L _L1 (G)＝E _x,y [||y-G(x)|| ₁ ] (2-3)

in order to ensure that the generated image does not lose structural information, a measure of structural similarity is introduced as in equation (2-4),

L _MS-SSIM ＝1-MS_SSIM(G(x),y) (2-4)

wherein, the first and the second end of the pipe are connected with each other,

the MS _ SSIM method takes a reference image and an original image signal as input, and sequentially applies a low-pass filter to carry out down-sampling on the image by 2 times; assuming that the scale of an original image is 1, and obtaining the highest scale M through M-1 iterations; on the jth scale, the comparative contrast and structural similarity, denoted c respectively, are calculated by the following equations (2-6) and (2-7) _j (G (x), y) and s _j (G(x),y)：

/>

wherein, mu _G And mu _y Means, σ, representing the generated image and the label image, respectively _G And σ _y Represents the variance; alpha is alpha _M ,β _j ,γ _j On the corresponding scale, luminance estimation, contrast and texture phaseThe importance of similarity components; to simplify the operation, let α _j ＝β _j ＝γ _j ,

C1, C2, C3 are all constants to avoid instability, C ₁ ＝(K ₁ L) ² ,C ₂ ＝(K ₂ L) ² ,C ₃ ＝C ₂ (ii)/2, taking parameter K ₁ ＝0.01,K ₂ =0.03, l is the dynamic range of pixel values;

L _total ＝L _cGAN +αL _L1 +βL _MS-SSIM +γL _FFL (2-10)

wherein, alpha, beta and gamma represent each loss value weight, and are set according to the requirement of actual training, and the larger the weight of the corresponding constraint term is, the larger the influence on the total target loss is; therefore, through a gradient descent method, the objective function is continuously and iteratively optimized to obtain an optimal solution.

The embodiment of the invention also provides a shooting cube, which comprises a control processor, and a camera which are respectively communicated with the control processor, wherein the camera is used for acquiring the video stream of an object to be shot in the shooting cube and transmitting the video stream to the control processor; the control processor is used for controlling the camera to shoot according to the received video stream and outputting the shot image after processing the shot image; wherein the control processor is configured to be able to execute the automatic photographing control method for a cube as described in any of the above embodiments.

Compared with the prior art, the automatic photographing control method and system suitable for the cube and the cube provided by the embodiment of the invention can effectively simplify the automatic photographing of the cube, a photographer can automatically trigger the camera to finish photographing through gestures without touching any physical button, and images photographed through the camera can be processed through the portrait facial skin, so that the skin looks more textured, healthy and beautiful, the image quality of photographed pictures is improved, and the use experience of users is effectively improved.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings required to be used in the embodiments will be briefly described below, and obviously, the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flow chart of an automatic photo taking control method suitable for a cube taking according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a gesture detection model involved in an automatic photo-taking control method suitable for a cube taking device according to an embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a generator of a generative confrontation network involved in an automatic photograph control method suitable for a cube.

Fig. 4 is a schematic structural diagram of a feature learning module of a generative confrontation network involved in an automatic photo-taking control method for a cube according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of an arbiter of a generative confrontation network involved in an automatic photograph control method for a cube according to an embodiment of the present invention.

Fig. 6 is a schematic structural diagram of an automatic photographing control system suitable for a cube.

Fig. 7 is a schematic structural diagram of a swatter cube according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", and the like, indicate orientations and positional relationships based on those shown in the drawings, and are used only for convenience of description and simplicity of description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered as limiting the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; may be mechanically, electrically or otherwise in communication with each other; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the present invention can be understood according to specific situations by those of ordinary skill in the art.

Referring to fig. 1, an embodiment of the present invention provides an automatic photographing control method suitable for a cube, where the cube includes a control processor, and a camera that respectively communicate with the control processor, and the automatic photographing control method includes:

Specifically, as shown in fig. 2, in step S1, the gesture detection model includes an input unit 21, an encoder 22, and a decoder 23, where the input unit 21 is configured to input a first N frames of images of a video stream acquired by the camera, and the encoder 22 is configured to perform feature extraction and feature fusion on the first N frames of images to obtain a feature map with richer information; the decoder 23 decodes the feature map into a detection result, thereby obtaining a hand region;

the feature extraction backbone network comprises 2x, 4x and 6x Mobilenetv2 blocks in one step, feature fusion adopts FPN to perform feature map fusion, and a decoding part YOLO regression classification.

Specifically, a loss function L of the following formula (1-1) is optimized based on a preset gesture detection training data set, and an optimal detection result is obtained through continuous iterative training by a gradient descent method:

Is the relative cell offset; c _i And &>

Representing the true object class probability and the prediction class probability of the corresponding unit cell; classes represents the number of object categories; lambda [ alpha ] _coord And λ _noobj Respectively representing coordinate predictions of bounding boxesA loss weight and a confidence prediction loss weight for a bounding box that does not contain an object.

It is understood that the preset gesture detection training data is about 2 ten thousand pieces, and the categories include palm, fist, thumb, heart of one hand, scissor hand and others. In the gesture detection process, the type of the gesture is not concerned, and the positioning function is mainly utilized. And optimizing the objective function on the data set, and continuously performing iterative training by a gradient descent method to obtain an optimal solution.

Further, in step S2, the kalman filtering includes a prediction process and an update process, the prediction process estimates a current state according to a state at a previous time, and the update process corrects prediction information according to an observation value, so as to estimate an optimal state; the method specifically comprises the following formulas (1-2) to (1-6):

and (3) a prediction process:

/>

and (3) updating:

wherein the content of the first and second substances,

representing a prediction estimate derived from a previous state; />

Representing a pair>

Further, in step S3, the to-be-recognized picture with the to-be-recognized object gesture and the reference gesture picture in the gesture template library are first scaled to the same size, then the euclidean distance is calculated according to the following formula (1-7), and the reference gesture picture with the minimum euclidean distance is found out as the reference gesture picture with the highest similarity:

The process of matching the gesture template can be described as solving the distance from a point to each element point in the set, and the best matching is to find the point with the minimum distance from the point to the element in the set, and to note t ^* 。

It can be seen that, firstly, by establishing a template library, some types of gesture pictures taken from a plurality of different angles are collected, and the types include "palm", "fist", "thumb", "single-hand heart", "scissor hand" and "other", which are classified into 6 types. And then, the picture to be identified and the template picture are scaled to the same size, the Euclidean distance is calculated according to the pixels corresponding to the formula (1-7), and the template picture with the highest similarity, namely the template with the smallest distance, is found out. If the distance is less than the set threshold value, the two pictures are very close to each other, and therefore t is judged ^* The category is the identification result of the picture to be detected.

Further, referring to fig. 3 to 5, in step S5, the constructed generative confrontation network is used to perform face beautification on the original image, so as to obtain a processed target image and output the processed target image. The face beautifying treatment mainly removes facial flaws and realizes the function of skin grinding, so that the skin looks more textured, healthy and beautiful, and the image quality of the shot photos is improved.

Specifically, the generative countermeasure network includes a generator and a discriminator, and fig. 4 is a detailed diagram of the F function in fig. 3. Considering that the output image G (X) is a slight change in the characteristics of the input image X, such as texture, color, etc., the generator adopts a residual structure in order to ensure the generated image quality, and the learning bias F (X) makes it easier for the model to learn to converge.

The generator is implemented as shown in the following formula (2-1),

G(X)＝F(X)+X (2-1)

x is an input original image, F (X) is output obtained after the original image X is processed by a feature learning module F, the feature learning module F mainly adopts a sparse connected parallel structure to extract features, the input original image X is reduced to the original 1/4 size through convolution pooling, then the resolution ratio is kept unchanged on a main path, the resolution ratio on a branch is 1/2 of that of a main branch, high-level semantic features are learned, and finally the high-level semantic features are mixed with the main branch with rich structural information and output.

In the whole model training process, D (X) adopts binary cross entropy as a loss function; in order to generate a required target image better, the Loss target of the generator adds three regularization terms on the basis of GAN Loss, calculates L1 Loss of numerical distance, calculates multi-level structural similarity (MS-SSIM) perceptual Loss, and calculates Focal Frequency Loss (FFL) of Frequency domain similarity, as follows:

L _cGAN (G,D)＝E _x,y [logD(x,y)]+E _x [log(1-D(x,G(x)))] (2-2)

L _L1 (G)＝E _x,y [||y-G(x)|| ₁ ] (2-3)

L _MS-SSIM ＝1-MS_SSIM(G(x),y) (2-4)

wherein the content of the first and second substances,

the MS _ SSIM method takes a reference image and an original image signal as input, and sequentially applies a low-pass filter to carry out down-sampling on the image by 2 times; assuming that the original image scale is 1, obtaining the highest ruler through M-1 iterationsThe degree is M; on the jth scale, the comparative contrast and structural similarity, denoted c respectively, are calculated by the following equations (2-6) and (2-7) _j (G (x), y) and s _j (G(x),y)：

wherein, mu _G And mu _y Mean, σ, representing the generated image and the label image, respectively _G And σ _y Represents the variance; alpha (alpha) ("alpha") _M ,β _j ,γ _j On the corresponding scale, the importance of brightness estimation, contrast and structural similarity components; to simplify the operation, let α _j ＝β _j ＝γ _j ,

L _total ＝L _cGAN +αL _L1 +βL _MS-SSIM +γL _FFL (2-10)

Referring to fig. 6, an embodiment of the present invention provides an automatic photographing control system suitable for a photographing cube, where the photographing cube includes a control processor, and a camera that respectively communicate with the control processor, and the automatic photographing control system includes:

the gesture detection module 61 is configured to perform gesture detection on the first N frames of images of the video stream acquired by the camera based on a gesture detection model of the YOLO to obtain a hand region; n is more than or equal to 3;

the hand tracking module 62 is used for modeling the target motion state of the hand region by using kalman filtering, tracking and predicting the position of the hand region in the next frame in real time, and obtaining the gesture of the object to be recognized;

the gesture recognition module 63 is configured to compare the gesture of the object to be recognized with a reference gesture in a preset gesture template library to calculate a similarity; when the calculated similarity is larger than a set threshold value, judging the gesture of the object to be recognized as a photographing triggering gesture, and starting photographing countdown; otherwise, returning to the step S1;

the photographing control module 64 is configured to control the camera to photograph when the photographing countdown is finished, so as to obtain an original image;

and the image beautification processing module 65 performs face beautification processing on the original image by using the constructed generative countermeasure network to obtain a processed target image and outputs the processed target image.

Specifically, in the gesture detection module 61, as shown in fig. 2, the gesture detection model includes an input unit 21, an encoder 22 and a decoder 23, the input unit 21 is configured to input the first N frames of images of the video stream acquired by the camera, and the encoder 22 is configured to perform feature extraction and feature fusion on the first N frames of images to obtain a feature map with richer information; the decoder 23 decodes the feature map into a detection result, thereby obtaining a hand region;

Indicates that the jth Anchor in the ith mesh indicates that this object prediction is not responsible; x is the number of _i ,y _i ,w _i ,h _i And &>

Is the relative cell offset; c _i And &>

Respectively representing the confidence coefficient of a real frame and the confidence coefficient of a prediction frame; p is a radical of _i (c) And &>

Representing the true object class probability and the prediction class probability of the corresponding unit cell; classes represents the number of object categories; lambda _coord And λ _noobj Respectively representing the loss weight of the coordinate prediction of the bounding box and the confidence degree prediction loss weight of the bounding box without the object;

specifically, in the hand tracking module 92, kalman filtering includes a prediction process and an update process, where the prediction process estimates a current state according to a state at a previous time, and the update process corrects prediction information according to an observed value, so as to estimate an optimal state; the method specifically comprises the following formulas (1-2) to (1-6):

and (3) prediction process:

and (3) updating:

representing a prediction estimate derived from a previous state; />

Representing a pair>

further, in the gesture recognition module 63, first, the to-be-recognized picture with the gesture of the to-be-recognized object and the reference gesture picture in the gesture template library are scaled to the same size, then, the euclidean distance is calculated according to the following formula (1-7), and the reference gesture picture with the minimum euclidean distance is found out to be the reference gesture picture with the highest similarity:

if found t ^* Is less than a preset threshold value, t is determined ^* And the similarity of the image to be recognized and the gesture of the object to be recognized is greater than a set threshold value, so that the gesture of the object to be recognized is determined as a photographing triggering gesture.

Further, in the image beautification processing module 65, the generative confrontation network includes a generator and a discriminator, the generator is implemented as shown in the following formula (2-1),

G(X)＝F(X)+X (2-1)

x is an input original image, F (X) is output obtained after the original image X is processed by a feature learning module F, the feature learning module F mainly adopts a sparse connected parallel structure to extract features, the input original image X is reduced to the original size of 1/4 through convolution pooling, then the resolution is kept unchanged on a main path, the resolution on branches is 1/2 of that of a main branch, high-level semantic features are learned, and finally the high-level semantic features are mixed with the main branch with rich structural information and output;

L _cGAN (G,D)＝E _x,y [logD(x,y)]+E _x [log(1-D(x,G(x)))] (2-2)

where G attempts to minimize this goal and D attempts to maximize, argmin _G max _D L _cGAN (G, D), wherein x is an input original image, and y is a target image;

L _L1 (G)＝E _x,y [||y-G(x)|| ₁ ] (2-3)

L _MS-SSIM ＝1-MS_SSIM(G(x),y) (2-4)

wherein the content of the first and second substances,

wherein, mu _G And mu _y Means, σ, representing the generated image and the label image, respectively _G And σ _y Represents the variance; alpha is alpha _M ,β _j ,γ _j On the corresponding scale, the importance of brightness estimation, contrast and structural similarity components; to simplify the operation, let α _j ＝β _j ＝γ _j ,

C1, C2, C3 are all constants to avoid instability, C ₁ ＝(K ₁ L) ² ,C ₂ ＝(K ₂ L) ² ,C ₃ ＝C ₂ (ii)/2, taking parameter K ₁ ＝0.01,K ₂ ＝003, L is the dynamic range of the pixel values;

L _total ＝L _cGAN +αL _L1 +βL _MS-SSIM +γL _FFL (2-10)

Referring to fig. 7, an embodiment of the present invention provides a cube shooting apparatus, where the cube shooting apparatus includes a control processor 1, and a camera 2 and a camera 3 that are respectively in communication with the control processor 1, where the camera 2 is configured to obtain a video stream of an object to be shot in the cube shooting apparatus, and transmit the video stream to the control processor 1; the control processor 1 is used for controlling the camera 3 to shoot according to the received video stream and outputting the shot image after processing; wherein the control processor 1 is arranged to be able to execute an automatic photographing control method suitable for cubic photographing as described in any of the above embodiments.

To sum up, the automatic photographing control method and system suitable for the cube photographing and the cube photographing can effectively simplify the automatic photographing of the cube photographing, a photographer can automatically trigger the camera to complete photographing through gestures without touching any physical button, and images photographed through the camera can be processed through the face skin of the portrait, so that the skin looks more textured, healthy and attractive, the image quality of photographed photos is improved, and the use experience of users is effectively improved.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An automatic photographing control method suitable for a cube, wherein the cube comprises a control processor, and a camera which are respectively communicated with the control processor, the automatic photographing control method comprises the following steps:

s5, carrying out face beautifying processing on the original image by using the constructed generative countermeasure network to obtain a processed target image and outputting the processed target image;

in step S5, the generative confrontation network includes a generator and an arbiter, the generator is implemented as shown in the following formula (2-1),

G(X)＝F(X)+X (2-1)

L _cGAN (G,D)＝E _x,y [logD(x,y)]+E _x [log(1-D(x,G(x)))] (2-2)

L _L1 (G)＝E _x,y [||y-G(x)|| ₁ ] (2-3)

L _MS-SSIM ＝1-MS_SSIM(G(x),y) (2-4)

wherein the content of the first and second substances,

the MS _ SSIM method is to use a reference imageAnd the original image signal is input, and a low-pass filter is sequentially applied to carry out down-sampling on the image by 2 times; assuming that the scale of an original image is 1, and obtaining the highest scale M through M-1 iterations; on the jth scale, the comparative contrast and structural similarity, denoted c respectively, are calculated by the following equations (2-6) and (2-7) _j (G (x), y) and s _j (G(x),y)：

C1, C2, C3 are all constants to avoid instability, C ₁ ＝(K ₁ L) ² ,C ₂ ＝(K ₂ L) ² ,C ₃ ＝C ₂ (iv) taking the parameter K ₁ ＝0.01,K ₂ =0.03, l is the dynamic range of pixel values;

wherein, mxN represents the size of an image, w (u, v) spectrum weight matrix, (u, v) is frequency domain coordinates, and y (u, v) and G (u, v) correspond to complex frequency values;

L _total ＝L _cGAN +αL _L1 +βL _MS-SSIM +γL _FFL (2-10)

2. The automatic photography control method for a cube according to claim 1,

in the step S1, the gesture detection model includes an input unit, an encoder and a decoder, the input unit is configured to input the first N frames of images of the video stream acquired by the camera, and the encoder is configured to perform feature extraction and feature fusion on the first N frames of images to obtain a feature map with richer information; the decoder decodes the feature map into a detection result, thereby obtaining a hand region;

/>

wherein, S2 represents the size of the feature map of the output layer, namely the number of grids divided by the input image; b is the number of anchors per grid,

to representWhether the jth Anchor in the ith grid is responsible for the object prediction or not is determined, and the largest Anchor in the ith grid is selected as a prediction frame from the real frame IoU; />

Is the relative cell offset; c _i And &>

Representing the true object class probability and the prediction class probability of the corresponding unit cell; class represents the number of object categories; lambda [ alpha ] _coord And λ _noobj And respectively representing the loss weight of the coordinate prediction of the bounding box and the confidence prediction loss weight of the bounding box without the object.

3. The automatic photographing control method for a cube of claim 1,

in the step S2, the kalman filtering includes a prediction process and an update process, the prediction process estimates a current state according to a state at a previous time, and the update process corrects prediction information according to an observation value, so as to estimate an optimal state; the method specifically comprises the following formulas (1-2) to (1-6):

and (3) prediction process:

and (3) updating:

representing a prediction estimate derived from a previous state; />

Represents a pair->

Representing a prediction error covariance matrix;Σ _t representing a filtering error covariance matrix;

4. The automatic photographing control method for a cube photographing according to claim 1,

in the step S3, the picture to be recognized with the gesture of the object to be recognized and the reference gesture picture in the gesture template library are firstly scaled to the same size, then the euclidean distance is calculated according to the following formula (1-7), and the reference gesture picture with the minimum euclidean distance is found out as the reference gesture picture with the highest similarity:

wherein x represents a picture to be recognized, T represents a reference gesture picture set in a gesture template library, T represents an element in the set T, and T represents ^* The element with the minimum Euclidean distance;

if found t ^* Is less than the preset threshold value, then t is determined ^* And the similarity between the gesture of the object to be recognized and the picture to be recognized is greater than a set threshold value, so that the gesture of the object to be recognized is judged as a photographing triggering gesture.

5. The utility model provides an automatic control system that shoots suitable for clap cube, wherein, clap cube include control processor and respectively with control processor communication's camera and camera, its characterized in that, automatic control system that shoots includes:

the gesture detection module is used for performing gesture detection on the first N frames of images of the video stream acquired by the camera based on a gesture detection model of the YOLO to obtain a hand area; n is more than or equal to 3;

the image beautification processing module is used for carrying out face beautification processing on the original image by utilizing the constructed generative countermeasure network to obtain a processed target image and outputting the processed target image;

in the image beautification processing module, the generative confrontation network comprises a generator and an arbiter, the generator is implemented as the following formula (2-1),

G(X)＝F(X)+X (2-1)

L _cGAN (G,D)＝E _x,y [logD(x,y)]+E _x [log(1-D(x,G(x)))] (2-2)

where G tries to minimize this goal and D tries to maximize, argmin _G max _D L _cGAN (G, D), x is the input original image, and y is the target image;

L _L1 (G)＝E _x,y [||y-G(x)|| ₁ ] (2-3)

L _MS-SSIM ＝1-MS_SSIM(G(x),y) (2-4)

wherein the content of the first and second substances,

the MS _ SSIM method takes a reference image and an original image signal as input, and sequentially applies a low-pass filter to carry out down-sampling on the image by 2 times; assuming that the scale of an original image is 1, obtaining the highest scale M through M-1 iterations; on the jth scale, the comparative contrast and the structural similarity, denoted c respectively, are calculated by the following equations (2-6) and (2-7) _j (G (x), y) and s _j (G(x),y)：

C1, C2, C3 are all constants to avoid instability, C ₁ ＝(K ₁ L) ² ,C ₂ ＝(K ₂ L) ² ,C ₃ ＝C ₂ (ii)/2, taking parameter K ₁ ＝0.01,K ₂ =0.03,l is the dynamic range of pixel values;

L _total ＝L _cGAN +αL _L1 +βL _MS-SSIM +γL _FFL (2-10)

6. The automated photography control system for a cube according to claim 5,

in the gesture detection module, the gesture detection model comprises an input unit, an encoder and a decoder, wherein the input unit is used for inputting the first N frames of images of the video stream acquired by the camera, and the encoder is used for performing feature extraction and feature fusion on the first N frames of images to acquire a feature map with richer information; the decoder decodes the feature map into a detection result, thereby obtaining a hand region;

wherein S is ² Representing the size of the output layer feature map, namely the number of grids divided by the input image; b is the number of anchors per mesh,

Indicating that the jth Anchor in the ith mesh indicates that this object prediction is not responsible; x is a radical of a fluorine atom _i ,y _i ,w _i ,h _i And &>

The central coordinates and the width and the height of each real label and the predicted frame are respectively represented and are between 0 and 1, x _i ,y _i ,/>

Is the relative cell offset; c _i And &>

and (3) prediction process:

and (3) updating:

wherein the content of the first and second substances,

representing a prediction estimate derived from a previous state; />

Represents a pair->

An updated optimal estimate; a represents a state transition matrix; h represents an observation matrix; b represents an input control matrix; u. of _t Representing the external control quantity of the system; q and R respectively represent a dynamic noise covariance matrix and a measurement noise covariance matrix; k _t Representing the Kalman gain at time t; />

according to the process, initializing the state of the tracker by using a hand region bounding box obtained by detecting the first N frames of images of the video stream acquired by the camera, and then continuously predicting the position of an estimated target in the next frame to realize real-time tracking;

in the gesture recognition module, firstly, the picture to be recognized with the gesture of the object to be recognized and a reference gesture picture in a gesture template library are scaled to be the same in size, then the Euclidean distance is calculated according to the following formula (1-7), and the reference gesture picture with the minimum Euclidean distance is found out and used as the reference gesture picture with the highest similarity:

wherein x is a picture to be recognized, T is a reference gesture picture set in a gesture template library, and T is an element in the set T，t ^* The element with the minimum Euclidean distance;

7. The utility model provides a shooting cube, characterized in that, the shooting cube includes control processor and camera with control processor communication respectively, the camera is used for acquireing the video stream of the object of waiting to shoot in the shooting cube, and with video stream transmission to the control processor; the control processor is used for controlling the camera to shoot according to the received video stream and outputting the shot image after processing the shot image; wherein the control processor is arranged to be able to execute an automatic photographing control method suitable for a photographing cube according to any one of claims 1 to 4.