CN115297263B - Automatic photographing control method and system suitable for cube shooting and cube shooting - Google Patents

Automatic photographing control method and system suitable for cube shooting and cube shooting Download PDF

Info

Publication number
CN115297263B
CN115297263B CN202211020415.6A CN202211020415A CN115297263B CN 115297263 B CN115297263 B CN 115297263B CN 202211020415 A CN202211020415 A CN 202211020415A CN 115297263 B CN115297263 B CN 115297263B
Authority
CN
China
Prior art keywords
gesture
image
prediction
recognized
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211020415.6A
Other languages
Chinese (zh)
Other versions
CN115297263A (en
Inventor
朱锦钊
于鹏
刘帅
林铠骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Fangtu Technology Co ltd
Original Assignee
Guangzhou Fangtu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Fangtu Technology Co ltd filed Critical Guangzhou Fangtu Technology Co ltd
Priority to CN202211020415.6A priority Critical patent/CN115297263B/en
Publication of CN115297263A publication Critical patent/CN115297263A/en
Application granted granted Critical
Publication of CN115297263B publication Critical patent/CN115297263B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06T5/77
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G07CHECKING-DEVICES
    • G07FCOIN-FREED OR LIKE APPARATUS
    • G07F17/00Coin-freed apparatus for hiring articles; Coin-freed facilities or services
    • G07F17/26Coin-freed apparatus for hiring articles; Coin-freed facilities or services for printing, stamping, franking, typing or teleprinting apparatus
    • G07F17/266Coin-freed apparatus for hiring articles; Coin-freed facilities or services for printing, stamping, franking, typing or teleprinting apparatus for the use of a photocopier or printing device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face

Abstract

The invention discloses an automatic photographing control method suitable for a cube, which comprises the following steps: s1, performing gesture detection on the first N frames of images of a video stream acquired by a camera based on a gesture detection model of YOLO to obtain a hand area; n is more than or equal to 3; s2, modeling the target motion state of the hand area by using Kalman filtering, and tracking and predicting the position of the hand area in the next frame in real time to obtain a gesture of an object to be recognized; s3, comparing the gesture of the object to be recognized with a reference gesture in a preset gesture template library to calculate similarity; when the calculated similarity is larger than a set threshold value, judging the gesture of the object to be recognized as a photographing triggering gesture, and starting photographing countdown; otherwise, returning to the step S1; s4, when the photographing countdown is finished, controlling the camera to photograph to obtain an original image; and S5, carrying out face beautifying processing on the original image by using the constructed generative confrontation network to obtain a processed target image and outputting the processed target image.

Description

Automatic photographing control method and system suitable for cube shooting and cube shooting
Technical Field
The invention relates to the technical field of self-service photographing, in particular to an automatic photographing control method and system suitable for a cube and the cube.
Background
The shooting cube, also called intelligent shooting box or intelligent shooting pavilion, is placed in public places with large people flow, such as street, subway stations, etc., is convenient for users to simply and conveniently shoot self-service shooting equipment for high-definition photos or certificate photos, etc., is very popular among people, and is spread in all public places such as commercial department stores, colleges and universities, tourist attractions, stations, airports, etc.
The inventor finds that the conventional shooting cube (self-help shooting device) has a space for further improving the shooting control mode and the quality of the shot image in the process of implementing the invention, and the invention aims to simplify the self-help shooting control mode of the shooting cube and improve the quality of the shot image.
Disclosure of Invention
The invention aims to provide an automatic photographing control method and system suitable for a cube shooting and the cube shooting, which can effectively solve the technical problems in the prior art.
In order to achieve the above object, an embodiment of the present invention provides an automatic photographing control method suitable for a photographing cube, wherein the photographing cube includes a control processor, and a camera respectively communicating with the control processor, and the automatic photographing control method includes:
s1, performing gesture detection on the first N frames of images of a video stream acquired by a camera based on a gesture detection model of YOLO to obtain a hand area; n is more than or equal to 3;
s2, modeling the target motion state of the hand area by using Kalman filtering, and tracking and predicting the position of the hand area in the next frame in real time to obtain a gesture of an object to be recognized;
s3, comparing the gesture of the object to be recognized with a reference gesture in a preset gesture template library to calculate similarity; when the calculated similarity is larger than a set threshold value, judging the gesture of the object to be recognized as a photographing triggering gesture, and starting photographing countdown; otherwise, returning to the step S1;
s4, when the photographing countdown is finished, controlling the camera to photograph to obtain an original image;
and S5, carrying out face beautifying processing on the original image by using the constructed generative confrontation network to obtain a processed target image and outputting the processed target image.
Preferably, in step S1, the gesture detection model includes an input unit, an encoder and a decoder, the input unit is configured to input the first N frames of images of the video stream acquired by the camera, and the encoder is configured to perform feature extraction and feature fusion on the first N frames of images to obtain a feature map with richer information; the decoder decodes the feature map into a detection result, thereby obtaining a hand region;
optimizing a loss function L of the following formula (1-1) based on a preset gesture detection training data set, and continuously performing iterative training by a gradient descent method to obtain an optimal detection result:
Figure BDA0003813919990000021
wherein S is 2 Representing the size of an output layer feature map, namely the number of grids divided by the input image; b is the number of anchors per grid,
Figure BDA0003813919990000022
indicating whether the jth Anchor in the ith grid is responsible for the object prediction, and selecting the largest one with the real frame IoU as a prediction frame; />
Figure BDA0003813919990000023
Indicating that the jth Anchor in the ith mesh indicates that this object prediction is not responsible; x is the number of i ,y i ,w i ,h i And &>
Figure BDA0003813919990000024
The center coordinates and the width and the height of each real label and the prediction frame are respectively expressed and are between 0 and 1, and x i ,y i ,/>
Figure BDA0003813919990000025
Is the relative cell offset; c i And &>
Figure BDA0003813919990000026
Respectively representTrue box confidence and predicted box confidence; p is a radical of i (c) And &>
Figure BDA0003813919990000027
Representing the true object class probability and the prediction class probability of the corresponding unit cell; classes represents the number of object categories; lambda [ alpha ] coord And λ noobj And respectively representing the loss weight of the coordinate prediction of the bounding box and the confidence prediction loss weight of the bounding box without the object.
Preferably, in step S2, the kalman filtering includes a prediction and an update process, the prediction is to estimate a current state according to a state at a previous time, and the update is to modify the prediction information according to the observation value, so as to estimate an optimal state; the method specifically comprises the following formulas (1-2) to (1-6):
and (3) prediction process:
Figure BDA0003813919990000031
Figure BDA0003813919990000032
and (3) updating:
Figure BDA0003813919990000033
Figure BDA0003813919990000034
Figure BDA0003813919990000035
wherein the content of the first and second substances,
Figure BDA0003813919990000036
representing a prediction estimate derived from a previous state; />
Figure BDA0003813919990000037
Represents a pair->
Figure BDA0003813919990000038
An updated optimal estimate; a represents a state transition matrix; h represents an observation matrix; b represents an input control matrix; u. of t Representing a system external control quantity; q and R respectively represent a dynamic noise covariance matrix and a measurement noise covariance matrix; k t Representing the Kalman gain at time t; />
Figure BDA0003813919990000039
Representing a prediction error covariance matrix; sigma t Representing a filtering error covariance matrix;
according to the process, the state of the tracker is initialized by utilizing a hand region bounding box obtained by detecting the first N frames of images of the video stream acquired by the camera, and then the position of the estimated target in the next frame is continuously predicted, so that real-time tracking is realized.
Preferably, in the step S3, the to-be-recognized picture with the gesture of the to-be-recognized object and the reference gesture picture in the gesture template library are firstly scaled to the same size, then the euclidean distance is calculated according to the following formula (1-7), and the reference gesture picture with the minimum euclidean distance is found out as the reference gesture picture with the highest similarity:
Figure BDA00038139199900000310
wherein x is a picture to be recognized, T is a reference gesture picture set in a gesture template library, T is an element in the set T, and T is * The element with the minimum Euclidean distance;
if t is found * Is less than a preset threshold value, t is determined * And the similarity between the gesture of the object to be recognized and the picture to be recognized is greater than a set threshold value, so that the gesture of the object to be recognized is judged as a photographing triggering gesture.
Preferably, in the step S5, the generative countermeasure network includes a generator and an arbiter, the generator is implemented as shown in the following formula (2-1),
G(X)=F(X)+X (2-1)
x is an input original image, F (X) is output obtained after the original image X is processed by a feature learning module F, the feature learning module F mainly adopts a sparse connected parallel structure to extract features, the input original image X is reduced to the original 1/4 size through convolution pooling, then the resolution ratio is kept unchanged on a main path, the resolution ratio on a branch is 1/2 of that of a main branch, high-level semantic features are learned, and finally the high-level semantic features are mixed with the main branch with rich structural information and output;
the input of the discriminator D (X) is X, G (X), after the characteristics are extracted through multilayer convolution, a characteristic diagram with a large receptive field and a small resolution ratio is obtained, and the average value of the characteristic diagram is used as the score of the discriminator.
Preferably, D (X) adopts binary cross entropy as a loss function in the whole model training process; in order to generate a required target image better, the loss target of the generator adds three regularization terms on the basis of GANLOS, calculates the L1 loss of the numerical distance, calculates the perception loss of the multi-level structure similarity and calculates the focal frequency loss of the frequency domain similarity, and specifically comprises the following steps:
the target loss function of the conditional GAN can be expressed as shown in equation (2-2),
L cGAN (G,D)=E x,y [logD(x,y)]+E x [log(1-D(x,G(x)))] (2-2)
where G attempts to minimize this goal and D attempts to maximize, argmin G max D L cGAN (G, D), x is the input original image, and y is the target image;
in order to ensure the consistency of input and output, L1 loss is introduced to carry out constraint such as formula (2-3),
L L1 (G)=E x,y [||y-G(x)|| 1 ] (2-3)
in order to prevent the generated image from losing the structural information, a measure of structural similarity is introduced as in equation (2-4),
L MS-SSIM =1-MS_SSIM(G(x),y) (2-4)
wherein the content of the first and second substances,
Figure BDA0003813919990000051
the MS _ SSIM method takes a reference image and an original image signal as input, and sequentially applies a low-pass filter to carry out down-sampling on the image by 2 times; assuming that the scale of an original image is 1, obtaining the highest scale M through M-1 iterations; on the jth scale, the comparative contrast and structural similarity, denoted c respectively, are calculated by the following equations (2-6) and (2-7) j (G (x), y) and s j (G(x),y):
Figure BDA0003813919990000052
Figure BDA0003813919990000053
/>
The luminance comparison is calculated by the formula (2-8) only at the scale M:
Figure BDA0003813919990000054
wherein, mu G And mu y Means, σ, representing the generated image and the label image, respectively G And σ y Represents a variance; alpha is alpha Mjj On the corresponding scale, the importance of brightness estimation, contrast and structural similarity components; to simplify the operation, let α j =β j =γ j ,
Figure BDA0003813919990000055
C1, C2, C3 are all constants to avoid instability, C 1 =(K 1 L) 2 ,C 2 =(K 2 L) 2 ,C 3 =C 2 (iv) taking the parameter K 1 =0.01,K 2 =0.03,l is the dynamic range of pixel values;
to further improve the image quality, the gap between the frequency domains is introduced as a constraint term, as in equations (2-9):
Figure BDA0003813919990000061
wherein MxN represents the size of an image, w (u, v) is a frequency spectrum weight matrix, (u, v) is a frequency domain coordinate, and y (u, v) and G (u, v) correspond to complex frequency values;
the overall optimization objective of the generator is therefore the following equation (2-10):
L total =L cGAN +αL L1 +βL MS-SSIM +γL FFL (2-10)
wherein, α, β and γ represent each loss value weight, and are set according to the requirement of actual training, and the larger the weight of the corresponding constraint term is, the larger the influence on the total target loss is; therefore, through a gradient descent method, the objective function is continuously and iteratively optimized to obtain an optimal solution.
The embodiment of the invention correspondingly provides an automatic photographing control system suitable for a cube shooting body, wherein the cube shooting body comprises a control processor, a camera and a camera, the camera and the camera are respectively communicated with the control processor, and the automatic photographing control system comprises:
the gesture detection module is used for carrying out gesture detection on the first N frames of images of the video stream acquired by the camera based on a gesture detection model of YOLO to obtain a hand area; n is more than or equal to 3;
the hand tracking module is used for modeling the target motion state of the hand area by using Kalman filtering, tracking and predicting the position of the hand area in the next frame in real time and obtaining the gesture of the object to be recognized;
the gesture recognition module is used for comparing the gesture of the object to be recognized with a reference gesture in a preset gesture template library so as to calculate the similarity; when the calculated similarity is larger than a set threshold value, judging the gesture of the object to be recognized as a photographing triggering gesture, and starting photographing countdown; otherwise, returning to the step S1;
the shooting control module is used for controlling the camera to shoot when the shooting countdown is finished so as to obtain an original image;
and the image beautification processing module is used for carrying out face beautification processing on the original image by utilizing the constructed generative confrontation network to obtain a processed target image and outputting the processed target image.
Preferably, in the gesture detection module, the gesture detection model includes an input unit, an encoder and a decoder, the input unit is configured to input the first N frames of images of the video stream acquired by the camera, and the encoder is configured to perform feature extraction and feature fusion on the first N frames of images to obtain a feature map with richer information; the decoder decodes the feature map into a detection result, thereby obtaining a hand region;
optimizing a loss function L of the following formula (1-1) based on a preset gesture detection training data set, and continuously performing iterative training by a gradient descent method to obtain an optimal detection result:
Figure BDA0003813919990000071
wherein S is 2 Representing the size of the output layer feature map, namely the number of grids divided by the input image; b is the number of anchors per grid,
Figure BDA0003813919990000072
representing whether the jth Anchor in the ith grid is responsible for the object prediction, selecting the largest one with the real frame IoU as the prediction frame; />
Figure BDA0003813919990000073
Indicates that the jth Anchor in the ith mesh indicates that this object prediction is not responsible; x is a radical of a fluorine atom i ,y i ,w i ,h i And &>
Figure BDA0003813919990000074
Representing each real label and predicted bounding box separatelyHas a central coordinate and a width and a height of 0 to 1, x i ,y i ,/>
Figure BDA0003813919990000075
Is the relative cell offset; c i And &>
Figure BDA0003813919990000076
Respectively representing the confidence coefficient of a real frame and the confidence coefficient of a predicted frame; p is a radical of i (c) And &>
Figure BDA0003813919990000077
Representing the true object class probability and the prediction class probability of the corresponding unit cell; classes represents the number of object categories; lambda [ alpha ] coord And λ noobj Respectively representing the loss weight of the coordinate prediction of the bounding box and the confidence degree prediction loss weight of the bounding box without the object;
in the hand tracking module, kalman filtering comprises a prediction process and an update process, wherein the prediction process is to estimate the current state according to the state of the previous moment, and the update process is to correct the prediction information according to an observation value so as to estimate the optimal state; the method specifically comprises the following formulas (1-2) to (1-6):
and (3) prediction process:
Figure BDA0003813919990000078
Figure BDA0003813919990000081
and (3) updating:
Figure BDA0003813919990000082
Figure BDA0003813919990000083
Figure BDA0003813919990000084
wherein the content of the first and second substances,
Figure BDA0003813919990000085
representing a prediction estimate derived from a previous state; />
Figure BDA0003813919990000086
Represents a pair->
Figure BDA0003813919990000087
An updated optimal estimate; a represents a state transition matrix; h represents an observation matrix; b represents an input control matrix; u. of t Representing a system external control quantity; q and R respectively represent a dynamic noise covariance matrix and a measurement noise covariance matrix; k is t Representing the Kalman gain at time t; />
Figure BDA0003813919990000088
Representing a prediction error covariance matrix; sigma t Representing a filtering error covariance matrix;
according to the process, initializing the state of the tracker by using a hand region bounding box obtained by detecting the first N frames of images of the video stream acquired by the camera, and then continuously predicting the position of the estimated target in the next frame to realize real-time tracking;
in the gesture recognition module, firstly, a picture to be recognized with the gesture of the object to be recognized and a reference gesture picture in a gesture template library are scaled to be the same in size, then, euclidean distance is calculated according to the following formula (1-7), and the reference gesture picture with the minimum Euclidean distance is found out and used as the reference gesture picture with the highest similarity:
Figure BDA0003813919990000089
wherein x represents a picture to be recognized, and T represents a reference in a gesture template libraryA set of gesture pictures, T represents an element in the set T, T * The element with the minimum Euclidean distance;
if t is found * Is less than a preset threshold value, t is determined * And the similarity of the image to be recognized and the gesture of the object to be recognized is greater than a set threshold value, so that the gesture of the object to be recognized is determined as a photographing triggering gesture.
Preferably, in the image beautification processing module, the generative countermeasure network comprises a generator and a discriminator, the generator is implemented as shown in the following formula (2-1),
G(X)=F(X)+X (2-1)
x is an input original image, F (X) is output obtained after the original image X is processed by a feature learning module F, the feature learning module F mainly adopts a sparse connected parallel structure to extract features, the input original image X is reduced to the original 1/4 size through convolution pooling, then the resolution ratio is kept unchanged on a main path, the resolution ratio on a branch is 1/2 of that of a main branch, high-level semantic features are learned, and finally the high-level semantic features are mixed with the main branch with rich structural information and output;
the input of the discriminator D (X) is X, G (X), after the characteristics are extracted through multilayer convolution, a characteristic diagram with a large receptive field and a small resolution ratio is obtained, and the average value of the characteristic diagram is used as the score of the discriminator;
in the whole model training process, D (X) adopts binary cross entropy as a loss function; in order to generate a required target image better, the loss target of the generator adds three regularization terms on the basis of GANLOS, calculates the L1 loss of the numerical distance, calculates the perception loss of the multi-level structure similarity and calculates the focal frequency loss of the frequency domain similarity, and specifically comprises the following steps:
the target loss function of the conditional GAN can be expressed as shown in equation (2-2),
L cGAN (G,D)=E x,y [logD(x,y)]+E x [log(1-D(x,G(x)))] (2-2)
where G attempts to minimize this goal and D attempts to maximize, argmin G max D L cGAN (G, D), x is the input original image, and y is the target image;
in order to ensure the consistency of input and output, L1 loss is introduced to carry out constraint such as formula (2-3),
L L1 (G)=E x,y [||y-G(x)|| 1 ] (2-3)
in order to ensure that the generated image does not lose structural information, a measure of structural similarity is introduced as in equation (2-4),
L MS-SSIM =1-MS_SSIM(G(x),y) (2-4)
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003813919990000091
the MS _ SSIM method takes a reference image and an original image signal as input, and sequentially applies a low-pass filter to carry out down-sampling on the image by 2 times; assuming that the scale of an original image is 1, and obtaining the highest scale M through M-1 iterations; on the jth scale, the comparative contrast and structural similarity, denoted c respectively, are calculated by the following equations (2-6) and (2-7) j (G (x), y) and s j (G(x),y):
Figure BDA0003813919990000101
/>
Figure BDA0003813919990000102
The luminance comparison is calculated by the formula (2-8) only at the scale M:
Figure BDA0003813919990000103
wherein, mu G And mu y Means, σ, representing the generated image and the label image, respectively G And σ y Represents the variance; alpha is alpha Mjj On the corresponding scale, luminance estimation, contrast and texture phaseThe importance of similarity components; to simplify the operation, let α j =β j =γ j ,
Figure BDA0003813919990000104
C1, C2, C3 are all constants to avoid instability, C 1 =(K 1 L) 2 ,C 2 =(K 2 L) 2 ,C 3 =C 2 (ii)/2, taking parameter K 1 =0.01,K 2 =0.03, l is the dynamic range of pixel values;
to further improve the image quality, the gap between the frequency domains is introduced as a constraint term, as in equations (2-9):
Figure BDA0003813919990000105
wherein MxN represents the size of an image, w (u, v) is a frequency spectrum weight matrix, (u, v) is a frequency domain coordinate, and y (u, v) and G (u, v) correspond to complex frequency values;
the overall optimization objective of the generator is therefore the following equation (2-10):
L total =L cGAN +αL L1 +βL MS-SSIM +γL FFL (2-10)
wherein, alpha, beta and gamma represent each loss value weight, and are set according to the requirement of actual training, and the larger the weight of the corresponding constraint term is, the larger the influence on the total target loss is; therefore, through a gradient descent method, the objective function is continuously and iteratively optimized to obtain an optimal solution.
The embodiment of the invention also provides a shooting cube, which comprises a control processor, and a camera which are respectively communicated with the control processor, wherein the camera is used for acquiring the video stream of an object to be shot in the shooting cube and transmitting the video stream to the control processor; the control processor is used for controlling the camera to shoot according to the received video stream and outputting the shot image after processing the shot image; wherein the control processor is configured to be able to execute the automatic photographing control method for a cube as described in any of the above embodiments.
Compared with the prior art, the automatic photographing control method and system suitable for the cube and the cube provided by the embodiment of the invention can effectively simplify the automatic photographing of the cube, a photographer can automatically trigger the camera to finish photographing through gestures without touching any physical button, and images photographed through the camera can be processed through the portrait facial skin, so that the skin looks more textured, healthy and beautiful, the image quality of photographed pictures is improved, and the use experience of users is effectively improved.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings required to be used in the embodiments will be briefly described below, and obviously, the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic flow chart of an automatic photo taking control method suitable for a cube taking according to an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of a gesture detection model involved in an automatic photo-taking control method suitable for a cube taking device according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a generator of a generative confrontation network involved in an automatic photograph control method suitable for a cube.
Fig. 4 is a schematic structural diagram of a feature learning module of a generative confrontation network involved in an automatic photo-taking control method for a cube according to an embodiment of the present invention.
Fig. 5 is a schematic structural diagram of an arbiter of a generative confrontation network involved in an automatic photograph control method for a cube according to an embodiment of the present invention.
Fig. 6 is a schematic structural diagram of an automatic photographing control system suitable for a cube.
Fig. 7 is a schematic structural diagram of a swatter cube according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", and the like, indicate orientations and positional relationships based on those shown in the drawings, and are used only for convenience of description and simplicity of description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered as limiting the present invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; may be mechanically, electrically or otherwise in communication with each other; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the present invention can be understood according to specific situations by those of ordinary skill in the art.
Referring to fig. 1, an embodiment of the present invention provides an automatic photographing control method suitable for a cube, where the cube includes a control processor, and a camera that respectively communicate with the control processor, and the automatic photographing control method includes:
s1, performing gesture detection on the first N frames of images of a video stream acquired by a camera based on a gesture detection model of YOLO to obtain a hand area; n is more than or equal to 3;
s2, modeling the target motion state of the hand area by using Kalman filtering, and tracking and predicting the position of the hand area in the next frame in real time to obtain a gesture of an object to be recognized;
s3, comparing the gesture of the object to be recognized with a reference gesture in a preset gesture template library to calculate similarity; when the calculated similarity is larger than a set threshold value, judging the gesture of the object to be recognized as a photographing triggering gesture, and starting photographing countdown; otherwise, returning to the step S1;
s4, when the photographing countdown is finished, controlling the camera to photograph to obtain an original image;
and S5, carrying out face beautifying processing on the original image by using the constructed generative confrontation network to obtain a processed target image and outputting the processed target image.
Specifically, as shown in fig. 2, in step S1, the gesture detection model includes an input unit 21, an encoder 22, and a decoder 23, where the input unit 21 is configured to input a first N frames of images of a video stream acquired by the camera, and the encoder 22 is configured to perform feature extraction and feature fusion on the first N frames of images to obtain a feature map with richer information; the decoder 23 decodes the feature map into a detection result, thereby obtaining a hand region;
the feature extraction backbone network comprises 2x, 4x and 6x Mobilenetv2 blocks in one step, feature fusion adopts FPN to perform feature map fusion, and a decoding part YOLO regression classification.
Specifically, a loss function L of the following formula (1-1) is optimized based on a preset gesture detection training data set, and an optimal detection result is obtained through continuous iterative training by a gradient descent method:
Figure BDA0003813919990000141
wherein S is 2 Representing the size of the output layer feature map, namely the number of grids divided by the input image; b is the number of anchors per grid,
Figure BDA0003813919990000142
representing whether the jth Anchor in the ith grid is responsible for the object prediction, selecting the largest one with the real frame IoU as the prediction frame; />
Figure BDA0003813919990000143
Indicating that the jth Anchor in the ith mesh indicates that this object prediction is not responsible; x is the number of i ,y i ,w i ,h i And &>
Figure BDA0003813919990000144
The center coordinates and the width and the height of each real label and the prediction frame are respectively expressed and are between 0 and 1, and x i ,y i ,/>
Figure BDA0003813919990000145
Is the relative cell offset; c i And &>
Figure BDA0003813919990000146
Respectively representing the confidence coefficient of a real frame and the confidence coefficient of a predicted frame; p is a radical of i (c) And &>
Figure BDA0003813919990000147
Representing the true object class probability and the prediction class probability of the corresponding unit cell; classes represents the number of object categories; lambda [ alpha ] coord And λ noobj Respectively representing coordinate predictions of bounding boxesA loss weight and a confidence prediction loss weight for a bounding box that does not contain an object.
It is understood that the preset gesture detection training data is about 2 ten thousand pieces, and the categories include palm, fist, thumb, heart of one hand, scissor hand and others. In the gesture detection process, the type of the gesture is not concerned, and the positioning function is mainly utilized. And optimizing the objective function on the data set, and continuously performing iterative training by a gradient descent method to obtain an optimal solution.
Further, in step S2, the kalman filtering includes a prediction process and an update process, the prediction process estimates a current state according to a state at a previous time, and the update process corrects prediction information according to an observation value, so as to estimate an optimal state; the method specifically comprises the following formulas (1-2) to (1-6):
and (3) a prediction process:
Figure BDA0003813919990000148
Figure BDA0003813919990000149
/>
and (3) updating:
Figure BDA0003813919990000151
Figure BDA0003813919990000152
Figure BDA0003813919990000153
wherein the content of the first and second substances,
Figure BDA0003813919990000154
representing a prediction estimate derived from a previous state; />
Figure BDA0003813919990000155
Representing a pair>
Figure BDA0003813919990000156
An updated optimal estimate; a represents a state transition matrix; h represents an observation matrix; b represents an input control matrix; u. of t Representing a system external control quantity; q and R respectively represent a dynamic noise covariance matrix and a measurement noise covariance matrix; k is t Representing the Kalman gain at time t; />
Figure BDA0003813919990000157
Representing a prediction error covariance matrix; sigma t Representing a filtering error covariance matrix;
according to the process, the state of the tracker is initialized by utilizing a hand region bounding box obtained by detecting the first N frames of images of the video stream acquired by the camera, and then the position of the estimated target in the next frame is continuously predicted, so that real-time tracking is realized.
Further, in step S3, the to-be-recognized picture with the to-be-recognized object gesture and the reference gesture picture in the gesture template library are first scaled to the same size, then the euclidean distance is calculated according to the following formula (1-7), and the reference gesture picture with the minimum euclidean distance is found out as the reference gesture picture with the highest similarity:
Figure BDA0003813919990000158
wherein x is a picture to be recognized, T is a reference gesture picture set in a gesture template library, T is an element in the set T, and T is * The element with the minimum Euclidean distance;
if t is found * Is less than a preset threshold value, t is determined * And the similarity of the image to be recognized and the gesture of the object to be recognized is greater than a set threshold value, so that the gesture of the object to be recognized is determined as a photographing triggering gesture.
The process of matching the gesture template can be described as solving the distance from a point to each element point in the set, and the best matching is to find the point with the minimum distance from the point to the element in the set, and to note t *
It can be seen that, firstly, by establishing a template library, some types of gesture pictures taken from a plurality of different angles are collected, and the types include "palm", "fist", "thumb", "single-hand heart", "scissor hand" and "other", which are classified into 6 types. And then, the picture to be identified and the template picture are scaled to the same size, the Euclidean distance is calculated according to the pixels corresponding to the formula (1-7), and the template picture with the highest similarity, namely the template with the smallest distance, is found out. If the distance is less than the set threshold value, the two pictures are very close to each other, and therefore t is judged * The category is the identification result of the picture to be detected.
Further, referring to fig. 3 to 5, in step S5, the constructed generative confrontation network is used to perform face beautification on the original image, so as to obtain a processed target image and output the processed target image. The face beautifying treatment mainly removes facial flaws and realizes the function of skin grinding, so that the skin looks more textured, healthy and beautiful, and the image quality of the shot photos is improved.
Specifically, the generative countermeasure network includes a generator and a discriminator, and fig. 4 is a detailed diagram of the F function in fig. 3. Considering that the output image G (X) is a slight change in the characteristics of the input image X, such as texture, color, etc., the generator adopts a residual structure in order to ensure the generated image quality, and the learning bias F (X) makes it easier for the model to learn to converge.
The generator is implemented as shown in the following formula (2-1),
G(X)=F(X)+X (2-1)
x is an input original image, F (X) is output obtained after the original image X is processed by a feature learning module F, the feature learning module F mainly adopts a sparse connected parallel structure to extract features, the input original image X is reduced to the original 1/4 size through convolution pooling, then the resolution ratio is kept unchanged on a main path, the resolution ratio on a branch is 1/2 of that of a main branch, high-level semantic features are learned, and finally the high-level semantic features are mixed with the main branch with rich structural information and output.
The input of the discriminator D (X) is X, G (X), after the characteristics are extracted through multilayer convolution, a characteristic diagram with a large receptive field and a small resolution ratio is obtained, and the average value of the characteristic diagram is used as the score of the discriminator.
In the whole model training process, D (X) adopts binary cross entropy as a loss function; in order to generate a required target image better, the Loss target of the generator adds three regularization terms on the basis of GAN Loss, calculates L1 Loss of numerical distance, calculates multi-level structural similarity (MS-SSIM) perceptual Loss, and calculates Focal Frequency Loss (FFL) of Frequency domain similarity, as follows:
the target loss function of the conditional GAN can be expressed as shown in equation (2-2),
L cGAN (G,D)=E x,y [logD(x,y)]+E x [log(1-D(x,G(x)))] (2-2)
where G attempts to minimize this goal and D attempts to maximize, argmin G max D L cGAN (G, D), x is the input original image, and y is the target image;
in order to ensure the consistency of input and output, L1 loss is introduced to carry out constraint such as formula (2-3),
L L1 (G)=E x,y [||y-G(x)|| 1 ] (2-3)
in order to prevent the generated image from losing the structural information, a measure of structural similarity is introduced as in equation (2-4),
L MS-SSIM =1-MS_SSIM(G(x),y) (2-4)
wherein the content of the first and second substances,
Figure BDA0003813919990000171
the MS _ SSIM method takes a reference image and an original image signal as input, and sequentially applies a low-pass filter to carry out down-sampling on the image by 2 times; assuming that the original image scale is 1, obtaining the highest ruler through M-1 iterationsThe degree is M; on the jth scale, the comparative contrast and structural similarity, denoted c respectively, are calculated by the following equations (2-6) and (2-7) j (G (x), y) and s j (G(x),y):
Figure BDA0003813919990000172
Figure BDA0003813919990000173
The luminance comparison is calculated by the formula (2-8) only at the scale M:
Figure BDA0003813919990000174
wherein, mu G And mu y Mean, σ, representing the generated image and the label image, respectively G And σ y Represents the variance; alpha (alpha) ("alpha") Mjj On the corresponding scale, the importance of brightness estimation, contrast and structural similarity components; to simplify the operation, let α j =β j =γ j ,
Figure BDA0003813919990000181
C1, C2, C3 are all constants to avoid instability, C 1 =(K 1 L) 2 ,C 2 =(K 2 L) 2 ,C 3 =C 2 (iv) taking the parameter K 1 =0.01,K 2 =0.03,l is the dynamic range of pixel values;
to further improve the image quality, the gap between the frequency domains is introduced as a constraint term, as in equations (2-9):
Figure BDA0003813919990000182
wherein MxN represents the size of an image, w (u, v) is a frequency spectrum weight matrix, (u, v) is a frequency domain coordinate, and y (u, v) and G (u, v) correspond to complex frequency values;
the overall optimization objective of the generator is therefore the following equation (2-10):
L total =L cGAN +αL L1 +βL MS-SSIM +γL FFL (2-10)
wherein, α, β and γ represent each loss value weight, and are set according to the requirement of actual training, and the larger the weight of the corresponding constraint term is, the larger the influence on the total target loss is; therefore, through a gradient descent method, the objective function is continuously and iteratively optimized to obtain an optimal solution.
Referring to fig. 6, an embodiment of the present invention provides an automatic photographing control system suitable for a photographing cube, where the photographing cube includes a control processor, and a camera that respectively communicate with the control processor, and the automatic photographing control system includes:
the gesture detection module 61 is configured to perform gesture detection on the first N frames of images of the video stream acquired by the camera based on a gesture detection model of the YOLO to obtain a hand region; n is more than or equal to 3;
the hand tracking module 62 is used for modeling the target motion state of the hand region by using kalman filtering, tracking and predicting the position of the hand region in the next frame in real time, and obtaining the gesture of the object to be recognized;
the gesture recognition module 63 is configured to compare the gesture of the object to be recognized with a reference gesture in a preset gesture template library to calculate a similarity; when the calculated similarity is larger than a set threshold value, judging the gesture of the object to be recognized as a photographing triggering gesture, and starting photographing countdown; otherwise, returning to the step S1;
the photographing control module 64 is configured to control the camera to photograph when the photographing countdown is finished, so as to obtain an original image;
and the image beautification processing module 65 performs face beautification processing on the original image by using the constructed generative countermeasure network to obtain a processed target image and outputs the processed target image.
Specifically, in the gesture detection module 61, as shown in fig. 2, the gesture detection model includes an input unit 21, an encoder 22 and a decoder 23, the input unit 21 is configured to input the first N frames of images of the video stream acquired by the camera, and the encoder 22 is configured to perform feature extraction and feature fusion on the first N frames of images to obtain a feature map with richer information; the decoder 23 decodes the feature map into a detection result, thereby obtaining a hand region;
optimizing a loss function L of the following formula (1-1) based on a preset gesture detection training data set, and continuously performing iterative training by a gradient descent method to obtain an optimal detection result:
Figure BDA0003813919990000191
wherein S is 2 Representing the size of the output layer feature map, namely the number of grids divided by the input image; b is the number of anchors per grid,
Figure BDA0003813919990000192
indicating whether the jth Anchor in the ith grid is responsible for the object prediction, and selecting the largest one with the real frame IoU as a prediction frame; />
Figure BDA0003813919990000193
Indicates that the jth Anchor in the ith mesh indicates that this object prediction is not responsible; x is the number of i ,y i ,w i ,h i And &>
Figure BDA0003813919990000194
The center coordinates and the width and the height of each real label and the prediction frame are respectively expressed and are between 0 and 1, and x i ,y i ,/>
Figure BDA0003813919990000195
Is the relative cell offset; c i And &>
Figure BDA0003813919990000196
Respectively representing the confidence coefficient of a real frame and the confidence coefficient of a prediction frame; p is a radical of i (c) And &>
Figure BDA0003813919990000197
Representing the true object class probability and the prediction class probability of the corresponding unit cell; classes represents the number of object categories; lambda coord And λ noobj Respectively representing the loss weight of the coordinate prediction of the bounding box and the confidence degree prediction loss weight of the bounding box without the object;
specifically, in the hand tracking module 92, kalman filtering includes a prediction process and an update process, where the prediction process estimates a current state according to a state at a previous time, and the update process corrects prediction information according to an observed value, so as to estimate an optimal state; the method specifically comprises the following formulas (1-2) to (1-6):
and (3) prediction process:
Figure BDA0003813919990000201
Figure BDA0003813919990000202
and (3) updating:
Figure BDA0003813919990000203
Figure BDA0003813919990000204
Figure BDA0003813919990000205
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003813919990000206
representing a prediction estimate derived from a previous state; />
Figure BDA0003813919990000207
Representing a pair>
Figure BDA0003813919990000208
An updated optimal estimate; a represents a state transition matrix; h represents an observation matrix; b represents an input control matrix; u. of t Representing a system external control quantity; q and R respectively represent a dynamic noise covariance matrix and a measurement noise covariance matrix; k t Representing the Kalman gain at time t; />
Figure BDA0003813919990000209
Representing a prediction error covariance matrix; sigma t Representing a filtering error covariance matrix;
according to the process, initializing the state of the tracker by using a hand region bounding box obtained by detecting the first N frames of images of the video stream acquired by the camera, and then continuously predicting the position of the estimated target in the next frame to realize real-time tracking;
further, in the gesture recognition module 63, first, the to-be-recognized picture with the gesture of the to-be-recognized object and the reference gesture picture in the gesture template library are scaled to the same size, then, the euclidean distance is calculated according to the following formula (1-7), and the reference gesture picture with the minimum euclidean distance is found out to be the reference gesture picture with the highest similarity:
Figure BDA0003813919990000211
wherein x is a picture to be recognized, T is a reference gesture picture set in a gesture template library, T is an element in the set T, and T is * The element with the minimum Euclidean distance;
if found t * Is less than a preset threshold value, t is determined * And the similarity of the image to be recognized and the gesture of the object to be recognized is greater than a set threshold value, so that the gesture of the object to be recognized is determined as a photographing triggering gesture.
Further, in the image beautification processing module 65, the generative confrontation network includes a generator and a discriminator, the generator is implemented as shown in the following formula (2-1),
G(X)=F(X)+X (2-1)
x is an input original image, F (X) is output obtained after the original image X is processed by a feature learning module F, the feature learning module F mainly adopts a sparse connected parallel structure to extract features, the input original image X is reduced to the original size of 1/4 through convolution pooling, then the resolution is kept unchanged on a main path, the resolution on branches is 1/2 of that of a main branch, high-level semantic features are learned, and finally the high-level semantic features are mixed with the main branch with rich structural information and output;
the input of the discriminator D (X) is X, G (X), after the characteristics are extracted through multilayer convolution, a characteristic diagram with a large receptive field and a small resolution ratio is obtained, and the average value of the characteristic diagram is used as the score of the discriminator;
in the whole model training process, D (X) adopts binary cross entropy as a loss function; in order to generate a required target image better, the loss target of the generator adds three regularization terms on the basis of GANLOS, calculates the L1 loss of the numerical distance, calculates the perception loss of the multi-level structure similarity and calculates the focal frequency loss of the frequency domain similarity, and specifically comprises the following steps:
the target loss function of the conditional GAN can be expressed as shown in equation (2-2),
L cGAN (G,D)=E x,y [logD(x,y)]+E x [log(1-D(x,G(x)))] (2-2)
where G attempts to minimize this goal and D attempts to maximize, argmin G max D L cGAN (G, D), wherein x is an input original image, and y is a target image;
in order to ensure the consistency of input and output, L1 loss is introduced to carry out constraint such as formula (2-3),
L L1 (G)=E x,y [||y-G(x)|| 1 ] (2-3)
in order to prevent the generated image from losing the structural information, a measure of structural similarity is introduced as in equation (2-4),
L MS-SSIM =1-MS_SSIM(G(x),y) (2-4)
wherein the content of the first and second substances,
Figure BDA0003813919990000221
the MS _ SSIM method takes a reference image and an original image signal as input, and sequentially applies a low-pass filter to carry out down-sampling on the image by 2 times; assuming that the scale of an original image is 1, obtaining the highest scale M through M-1 iterations; on the jth scale, the comparative contrast and structural similarity, denoted c respectively, are calculated by the following equations (2-6) and (2-7) j (G (x), y) and s j (G(x),y):
Figure BDA0003813919990000222
Figure BDA0003813919990000223
The luminance comparison is calculated by the formula (2-8) only at the scale M:
Figure BDA0003813919990000224
wherein, mu G And mu y Means, σ, representing the generated image and the label image, respectively G And σ y Represents the variance; alpha is alpha Mjj On the corresponding scale, the importance of brightness estimation, contrast and structural similarity components; to simplify the operation, let α j =β j =γ j ,
Figure BDA0003813919990000225
C1, C2, C3 are all constants to avoid instability, C 1 =(K 1 L) 2 ,C 2 =(K 2 L) 2 ,C 3 =C 2 (ii)/2, taking parameter K 1 =0.01,K 2 =003, L is the dynamic range of the pixel values;
to further improve the image quality, the gap between the frequency domains is introduced as a constraint term, as in equations (2-9):
Figure BDA0003813919990000231
wherein MxN represents the size of an image, w (u, v) is a frequency spectrum weight matrix, (u, v) is a frequency domain coordinate, and y (u, v) and G (u, v) correspond to complex frequency values;
the overall optimization objective of the generator is therefore the following equation (2-10):
L total =L cGAN +αL L1 +βL MS-SSIM +γL FFL (2-10)
wherein, α, β and γ represent each loss value weight, and are set according to the requirement of actual training, and the larger the weight of the corresponding constraint term is, the larger the influence on the total target loss is; therefore, through a gradient descent method, the objective function is continuously and iteratively optimized to obtain an optimal solution.
Referring to fig. 7, an embodiment of the present invention provides a cube shooting apparatus, where the cube shooting apparatus includes a control processor 1, and a camera 2 and a camera 3 that are respectively in communication with the control processor 1, where the camera 2 is configured to obtain a video stream of an object to be shot in the cube shooting apparatus, and transmit the video stream to the control processor 1; the control processor 1 is used for controlling the camera 3 to shoot according to the received video stream and outputting the shot image after processing; wherein the control processor 1 is arranged to be able to execute an automatic photographing control method suitable for cubic photographing as described in any of the above embodiments.
To sum up, the automatic photographing control method and system suitable for the cube photographing and the cube photographing can effectively simplify the automatic photographing of the cube photographing, a photographer can automatically trigger the camera to complete photographing through gestures without touching any physical button, and images photographed through the camera can be processed through the face skin of the portrait, so that the skin looks more textured, healthy and attractive, the image quality of photographed photos is improved, and the use experience of users is effectively improved.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. An automatic photographing control method suitable for a cube, wherein the cube comprises a control processor, and a camera which are respectively communicated with the control processor, the automatic photographing control method comprises the following steps:
s1, performing gesture detection on the first N frames of images of a video stream acquired by a camera based on a gesture detection model of YOLO to obtain a hand area; n is more than or equal to 3;
s2, modeling the target motion state of the hand area by using Kalman filtering, and tracking and predicting the position of the hand area in the next frame in real time to obtain a gesture of an object to be recognized;
s3, comparing the gesture of the object to be recognized with a reference gesture in a preset gesture template library to calculate similarity; when the calculated similarity is larger than a set threshold value, judging the gesture of the object to be recognized as a photographing triggering gesture, and starting photographing countdown; otherwise, returning to the step S1;
s4, when the photographing countdown is finished, controlling the camera to photograph to obtain an original image;
s5, carrying out face beautifying processing on the original image by using the constructed generative countermeasure network to obtain a processed target image and outputting the processed target image;
in step S5, the generative confrontation network includes a generator and an arbiter, the generator is implemented as shown in the following formula (2-1),
G(X)=F(X)+X (2-1)
x is an input original image, F (X) is output obtained after the original image X is processed by a feature learning module F, the feature learning module F mainly adopts a sparse connected parallel structure to extract features, the input original image X is reduced to the original 1/4 size through convolution pooling, then the resolution ratio is kept unchanged on a main path, the resolution ratio on a branch is 1/2 of that of a main branch, high-level semantic features are learned, and finally the high-level semantic features are mixed with the main branch with rich structural information and output;
the input of the discriminator D (X) is X, G (X), after the characteristics are extracted through multilayer convolution, a characteristic diagram with a large receptive field and a small resolution ratio is obtained, and the average value of the characteristic diagram is used as the score of the discriminator;
in the whole model training process, D (X) adopts binary cross entropy as a loss function; in order to generate a required target image better, the loss target of the generator adds three regularization terms on the basis of GANLOS, calculates the L1 loss of the numerical distance, calculates the perception loss of the multi-level structure similarity and calculates the focal frequency loss of the frequency domain similarity, and specifically comprises the following steps:
the target loss function of the conditional GAN can be expressed as shown in equation (2-2),
L cGAN (G,D)=E x,y [logD(x,y)]+E x [log(1-D(x,G(x)))] (2-2)
where G attempts to minimize this goal and D attempts to maximize, argmin G max D L cGAN (G, D), x is the input original image, and y is the target image;
in order to ensure the consistency of input and output, L1 loss is introduced to carry out constraint such as formula (2-3),
L L1 (G)=E x,y [||y-G(x)|| 1 ] (2-3)
in order to prevent the generated image from losing the structural information, a measure of structural similarity is introduced as in equation (2-4),
L MS-SSIM =1-MS_SSIM(G(x),y) (2-4)
wherein the content of the first and second substances,
Figure FDA0004056248810000021
the MS _ SSIM method is to use a reference imageAnd the original image signal is input, and a low-pass filter is sequentially applied to carry out down-sampling on the image by 2 times; assuming that the scale of an original image is 1, and obtaining the highest scale M through M-1 iterations; on the jth scale, the comparative contrast and structural similarity, denoted c respectively, are calculated by the following equations (2-6) and (2-7) j (G (x), y) and s j (G(x),y):
Figure FDA0004056248810000022
Figure FDA0004056248810000023
The luminance comparison is calculated by the formula (2-8) only at the scale M:
Figure FDA0004056248810000031
wherein, mu G And mu y Means, σ, representing the generated image and the label image, respectively G And σ y Represents a variance; alpha is alpha Mjj On the corresponding scale, the importance of brightness estimation, contrast and structural similarity components; to simplify the operation, let α j =β j =γ j ,
Figure FDA0004056248810000032
C1, C2, C3 are all constants to avoid instability, C 1 =(K 1 L) 2 ,C 2 =(K 2 L) 2 ,C 3 =C 2 (iv) taking the parameter K 1 =0.01,K 2 =0.03, l is the dynamic range of pixel values;
to further improve the image quality, the gap between the frequency domains is introduced as a constraint term, as in equations (2-9):
Figure FDA0004056248810000033
wherein, mxN represents the size of an image, w (u, v) spectrum weight matrix, (u, v) is frequency domain coordinates, and y (u, v) and G (u, v) correspond to complex frequency values;
the overall optimization objective of the generator is therefore the following equation (2-10):
L total =L cGAN +αL L1 +βL MS-SSIM +γL FFL (2-10)
wherein, α, β and γ represent each loss value weight, and are set according to the requirement of actual training, and the larger the weight of the corresponding constraint term is, the larger the influence on the total target loss is; therefore, through a gradient descent method, the objective function is continuously and iteratively optimized to obtain an optimal solution.
2. The automatic photography control method for a cube according to claim 1,
in the step S1, the gesture detection model includes an input unit, an encoder and a decoder, the input unit is configured to input the first N frames of images of the video stream acquired by the camera, and the encoder is configured to perform feature extraction and feature fusion on the first N frames of images to obtain a feature map with richer information; the decoder decodes the feature map into a detection result, thereby obtaining a hand region;
optimizing a loss function L of the following formula (1-1) based on a preset gesture detection training data set, and continuously performing iterative training by a gradient descent method to obtain an optimal detection result:
Figure FDA0004056248810000041
/>
wherein, S2 represents the size of the feature map of the output layer, namely the number of grids divided by the input image; b is the number of anchors per grid,
Figure FDA0004056248810000042
to representWhether the jth Anchor in the ith grid is responsible for the object prediction or not is determined, and the largest Anchor in the ith grid is selected as a prediction frame from the real frame IoU; />
Figure FDA0004056248810000043
Indicating that the jth Anchor in the ith mesh indicates that this object prediction is not responsible; x is the number of i ,y i ,w i ,h i And &>
Figure FDA0004056248810000044
The center coordinates and the width and the height of each real label and the prediction frame are respectively expressed and are between 0 and 1, and x i ,y i ,/>
Figure FDA0004056248810000045
Is the relative cell offset; c i And &>
Figure FDA0004056248810000046
Respectively representing the confidence coefficient of a real frame and the confidence coefficient of a prediction frame; p is a radical of i (c) And &>
Figure FDA0004056248810000047
Representing the true object class probability and the prediction class probability of the corresponding unit cell; class represents the number of object categories; lambda [ alpha ] coord And λ noobj And respectively representing the loss weight of the coordinate prediction of the bounding box and the confidence prediction loss weight of the bounding box without the object.
3. The automatic photographing control method for a cube of claim 1,
in the step S2, the kalman filtering includes a prediction process and an update process, the prediction process estimates a current state according to a state at a previous time, and the update process corrects prediction information according to an observation value, so as to estimate an optimal state; the method specifically comprises the following formulas (1-2) to (1-6):
and (3) prediction process:
Figure FDA0004056248810000048
Figure FDA0004056248810000049
and (3) updating:
Figure FDA0004056248810000051
Figure FDA0004056248810000052
Figure FDA0004056248810000053
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0004056248810000054
representing a prediction estimate derived from a previous state; />
Figure FDA0004056248810000055
Represents a pair->
Figure FDA0004056248810000056
An updated optimal estimate; a represents a state transition matrix; h represents an observation matrix; b represents an input control matrix; u. of t Representing a system external control quantity; q and R respectively represent a dynamic noise covariance matrix and a measurement noise covariance matrix; k is t Representing the Kalman gain at time t; />
Figure FDA0004056248810000057
Representing a prediction error covariance matrix;Σ t representing a filtering error covariance matrix;
according to the process, the state of the tracker is initialized by utilizing a hand region bounding box obtained by detecting the first N frames of images of the video stream acquired by the camera, and then the position of the estimated target in the next frame is continuously predicted, so that real-time tracking is realized.
4. The automatic photographing control method for a cube photographing according to claim 1,
in the step S3, the picture to be recognized with the gesture of the object to be recognized and the reference gesture picture in the gesture template library are firstly scaled to the same size, then the euclidean distance is calculated according to the following formula (1-7), and the reference gesture picture with the minimum euclidean distance is found out as the reference gesture picture with the highest similarity:
Figure FDA0004056248810000058
wherein x represents a picture to be recognized, T represents a reference gesture picture set in a gesture template library, T represents an element in the set T, and T represents * The element with the minimum Euclidean distance;
if found t * Is less than the preset threshold value, then t is determined * And the similarity between the gesture of the object to be recognized and the picture to be recognized is greater than a set threshold value, so that the gesture of the object to be recognized is judged as a photographing triggering gesture.
5. The utility model provides an automatic control system that shoots suitable for clap cube, wherein, clap cube include control processor and respectively with control processor communication's camera and camera, its characterized in that, automatic control system that shoots includes:
the gesture detection module is used for performing gesture detection on the first N frames of images of the video stream acquired by the camera based on a gesture detection model of the YOLO to obtain a hand area; n is more than or equal to 3;
the hand tracking module is used for modeling the target motion state of the hand area by using Kalman filtering, tracking and predicting the position of the hand area in the next frame in real time and obtaining the gesture of the object to be recognized;
the gesture recognition module is used for comparing the gesture of the object to be recognized with a reference gesture in a preset gesture template library so as to calculate the similarity; when the calculated similarity is larger than a set threshold value, judging the gesture of the object to be recognized as a photographing triggering gesture, and starting photographing countdown; otherwise, returning to the step S1;
the shooting control module is used for controlling the camera to shoot when the shooting countdown is finished so as to obtain an original image;
the image beautification processing module is used for carrying out face beautification processing on the original image by utilizing the constructed generative countermeasure network to obtain a processed target image and outputting the processed target image;
in the image beautification processing module, the generative confrontation network comprises a generator and an arbiter, the generator is implemented as the following formula (2-1),
G(X)=F(X)+X (2-1)
x is an input original image, F (X) is output obtained after the original image X is processed by a feature learning module F, the feature learning module F mainly adopts a sparse connected parallel structure to extract features, the input original image X is reduced to the original 1/4 size through convolution pooling, then the resolution ratio is kept unchanged on a main path, the resolution ratio on a branch is 1/2 of that of a main branch, high-level semantic features are learned, and finally the high-level semantic features are mixed with the main branch with rich structural information and output;
the input of the discriminator D (X) is X, G (X), after the characteristics are extracted through multilayer convolution, a characteristic diagram with a large receptive field and a small resolution ratio is obtained, and the average value of the characteristic diagram is used as the score of the discriminator;
in the whole model training process, D (X) adopts binary cross entropy as a loss function; in order to generate a required target image better, the loss target of the generator adds three regularization terms on the basis of GANLOS, calculates the L1 loss of the numerical distance, calculates the perception loss of the multi-level structure similarity and calculates the focal frequency loss of the frequency domain similarity, and specifically comprises the following steps:
the target loss function of the conditional GAN can be expressed as shown in equation (2-2),
L cGAN (G,D)=E x,y [logD(x,y)]+E x [log(1-D(x,G(x)))] (2-2)
where G tries to minimize this goal and D tries to maximize, argmin G max D L cGAN (G, D), x is the input original image, and y is the target image;
in order to ensure the consistency of input and output, L1 loss is introduced to carry out constraint such as formula (2-3),
L L1 (G)=E x,y [||y-G(x)|| 1 ] (2-3)
in order to ensure that the generated image does not lose structural information, a measure of structural similarity is introduced as in equation (2-4),
L MS-SSIM =1-MS_SSIM(G(x),y) (2-4)
wherein the content of the first and second substances,
Figure FDA0004056248810000071
the MS _ SSIM method takes a reference image and an original image signal as input, and sequentially applies a low-pass filter to carry out down-sampling on the image by 2 times; assuming that the scale of an original image is 1, obtaining the highest scale M through M-1 iterations; on the jth scale, the comparative contrast and the structural similarity, denoted c respectively, are calculated by the following equations (2-6) and (2-7) j (G (x), y) and s j (G(x),y):
Figure FDA0004056248810000072
Figure FDA0004056248810000073
The luminance comparison is calculated by the formula (2-8) only at the scale M:
Figure FDA0004056248810000081
wherein, mu G And mu y Means, σ, representing the generated image and the label image, respectively G And σ y Represents the variance; alpha is alpha Mjj On the corresponding scale, the importance of brightness estimation, contrast and structural similarity components; to simplify the operation, let α j =β j =γ j ,
Figure FDA0004056248810000082
C1, C2, C3 are all constants to avoid instability, C 1 =(K 1 L) 2 ,C 2 =(K 2 L) 2 ,C 3 =C 2 (ii)/2, taking parameter K 1 =0.01,K 2 =0.03,l is the dynamic range of pixel values;
to further improve the image quality, the gap between the frequency domains is introduced as a constraint term, as in equations (2-9):
Figure FDA0004056248810000083
wherein MxN represents the size of an image, w (u, v) is a frequency spectrum weight matrix, (u, v) is a frequency domain coordinate, and y (u, v) and G (u, v) correspond to complex frequency values;
the overall optimization objective of the generator is therefore the following equation (2-10):
L total =L cGAN +αL L1 +βL MS-SSIM +γL FFL (2-10)
wherein, α, β and γ represent each loss value weight, and are set according to the requirement of actual training, and the larger the weight of the corresponding constraint term is, the larger the influence on the total target loss is; therefore, through a gradient descent method, the objective function is continuously and iteratively optimized to obtain an optimal solution.
6. The automated photography control system for a cube according to claim 5,
in the gesture detection module, the gesture detection model comprises an input unit, an encoder and a decoder, wherein the input unit is used for inputting the first N frames of images of the video stream acquired by the camera, and the encoder is used for performing feature extraction and feature fusion on the first N frames of images to acquire a feature map with richer information; the decoder decodes the feature map into a detection result, thereby obtaining a hand region;
optimizing a loss function L of the following formula (1-1) based on a preset gesture detection training data set, and continuously performing iterative training by a gradient descent method to obtain an optimal detection result:
Figure FDA0004056248810000091
wherein S is 2 Representing the size of the output layer feature map, namely the number of grids divided by the input image; b is the number of anchors per mesh,
Figure FDA0004056248810000092
representing whether the jth Anchor in the ith grid is responsible for the object prediction, selecting the largest one with the real frame IoU as the prediction frame; />
Figure FDA0004056248810000093
Indicating that the jth Anchor in the ith mesh indicates that this object prediction is not responsible; x is a radical of a fluorine atom i ,y i ,w i ,h i And &>
Figure FDA0004056248810000094
The central coordinates and the width and the height of each real label and the predicted frame are respectively represented and are between 0 and 1, x i ,y i ,/>
Figure FDA0004056248810000095
Is the relative cell offset; c i And &>
Figure FDA0004056248810000096
Respectively representing the confidence coefficient of a real frame and the confidence coefficient of a prediction frame; p is a radical of i (c) And &>
Figure FDA0004056248810000097
Representing the true object class probability and the prediction class probability of the corresponding unit cell; classes represents the number of object categories; lambda [ alpha ] coord And λ noobj Respectively representing the loss weight of the coordinate prediction of the bounding box and the confidence degree prediction loss weight of the bounding box without the object;
in the hand tracking module, kalman filtering comprises a prediction process and an update process, wherein the prediction process is to estimate the current state according to the state of the previous moment, and the update process is to correct the prediction information according to an observation value so as to estimate the optimal state; the method specifically comprises the following formulas (1-2) to (1-6):
and (3) prediction process:
Figure FDA0004056248810000098
Figure FDA0004056248810000099
and (3) updating:
Figure FDA00040562488100000910
Figure FDA00040562488100000911
Figure FDA0004056248810000101
wherein the content of the first and second substances,
Figure FDA0004056248810000102
representing a prediction estimate derived from a previous state; />
Figure FDA0004056248810000103
Represents a pair->
Figure FDA0004056248810000104
An updated optimal estimate; a represents a state transition matrix; h represents an observation matrix; b represents an input control matrix; u. of t Representing the external control quantity of the system; q and R respectively represent a dynamic noise covariance matrix and a measurement noise covariance matrix; k t Representing the Kalman gain at time t; />
Figure FDA0004056248810000105
Representing a prediction error covariance matrix; sigma t Representing a filtering error covariance matrix;
according to the process, initializing the state of the tracker by using a hand region bounding box obtained by detecting the first N frames of images of the video stream acquired by the camera, and then continuously predicting the position of an estimated target in the next frame to realize real-time tracking;
in the gesture recognition module, firstly, the picture to be recognized with the gesture of the object to be recognized and a reference gesture picture in a gesture template library are scaled to be the same in size, then the Euclidean distance is calculated according to the following formula (1-7), and the reference gesture picture with the minimum Euclidean distance is found out and used as the reference gesture picture with the highest similarity:
Figure FDA0004056248810000106
wherein x is a picture to be recognized, T is a reference gesture picture set in a gesture template library, and T is an element in the set T,t * The element with the minimum Euclidean distance;
if t is found * Is less than a preset threshold value, t is determined * And the similarity between the gesture of the object to be recognized and the picture to be recognized is greater than a set threshold value, so that the gesture of the object to be recognized is judged as a photographing triggering gesture.
7. The utility model provides a shooting cube, characterized in that, the shooting cube includes control processor and camera with control processor communication respectively, the camera is used for acquireing the video stream of the object of waiting to shoot in the shooting cube, and with video stream transmission to the control processor; the control processor is used for controlling the camera to shoot according to the received video stream and outputting the shot image after processing the shot image; wherein the control processor is arranged to be able to execute an automatic photographing control method suitable for a photographing cube according to any one of claims 1 to 4.
CN202211020415.6A 2022-08-24 2022-08-24 Automatic photographing control method and system suitable for cube shooting and cube shooting Active CN115297263B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211020415.6A CN115297263B (en) 2022-08-24 2022-08-24 Automatic photographing control method and system suitable for cube shooting and cube shooting

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211020415.6A CN115297263B (en) 2022-08-24 2022-08-24 Automatic photographing control method and system suitable for cube shooting and cube shooting

Publications (2)

Publication Number Publication Date
CN115297263A CN115297263A (en) 2022-11-04
CN115297263B true CN115297263B (en) 2023-04-07

Family

ID=83832259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211020415.6A Active CN115297263B (en) 2022-08-24 2022-08-24 Automatic photographing control method and system suitable for cube shooting and cube shooting

Country Status (1)

Country Link
CN (1) CN115297263B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020164282A1 (en) * 2019-02-14 2020-08-20 平安科技(深圳)有限公司 Yolo-based image target recognition method and apparatus, electronic device, and storage medium
CN113223059A (en) * 2021-05-17 2021-08-06 浙江大学 Weak and small airspace target detection method based on super-resolution feature enhancement

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102420942A (en) * 2011-11-28 2012-04-18 康佳集团股份有限公司 Photograph device and photograph control method based on same
CN106454071A (en) * 2016-09-09 2017-02-22 捷开通讯(深圳)有限公司 Terminal and automatic shooting method based on gestures
CN109815893B (en) * 2019-01-23 2021-03-26 中山大学 Color face image illumination domain normalization method based on cyclic generation countermeasure network
CN111062312B (en) * 2019-12-13 2023-10-27 RealMe重庆移动通信有限公司 Gesture recognition method, gesture control device, medium and terminal equipment
CN112506342B (en) * 2020-12-04 2022-01-28 郑州中业科技股份有限公司 Man-machine interaction method and system based on dynamic gesture recognition
CN112837234B (en) * 2021-01-25 2022-07-22 重庆师范大学 Human face image restoration method based on multi-column gating convolution network
CN113608663B (en) * 2021-07-12 2023-07-25 哈尔滨工程大学 Fingertip tracking method based on deep learning and K-curvature method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020164282A1 (en) * 2019-02-14 2020-08-20 平安科技(深圳)有限公司 Yolo-based image target recognition method and apparatus, electronic device, and storage medium
CN113223059A (en) * 2021-05-17 2021-08-06 浙江大学 Weak and small airspace target detection method based on super-resolution feature enhancement

Also Published As

Publication number Publication date
CN115297263A (en) 2022-11-04

Similar Documents

Publication Publication Date Title
US11847826B2 (en) System and method for providing dominant scene classification by semantic segmentation
CN113065558B (en) Lightweight small target detection method combined with attention mechanism
CN107392097B (en) Three-dimensional human body joint point positioning method of monocular color video
CN109815826B (en) Method and device for generating face attribute model
CN108234870B (en) Image processing method, device, terminal and storage medium
CN106780543B (en) A kind of double frame estimating depths and movement technique based on convolutional neural networks
CN110490252B (en) Indoor people number detection method and system based on deep learning
CN110532970B (en) Age and gender attribute analysis method, system, equipment and medium for 2D images of human faces
WO2019210555A1 (en) People counting method and device based on deep neural network and storage medium
CN113286194A (en) Video processing method and device, electronic equipment and readable storage medium
US20110299774A1 (en) Method and system for detecting and tracking hands in an image
WO2022073282A1 (en) Motion recognition method based on feature interactive learning, and terminal device
WO2020171379A1 (en) Capturing a photo using a mobile device
CN111147751B (en) Photographing mode generation method and device and computer readable storage medium
CN114360067A (en) Dynamic gesture recognition method based on deep learning
CN111723707A (en) Method and device for estimating fixation point based on visual saliency
CN112434608A (en) Human behavior identification method and system based on double-current combined network
Küchhold et al. Scale-adaptive real-time crowd detection and counting for drone images
WO2023142912A1 (en) Method and apparatus for detecting left behind object, and storage medium
CN109063549A (en) High-resolution based on deep neural network is taken photo by plane video moving object detection method
CN114708615A (en) Human body detection method based on image enhancement in low-illumination environment, electronic equipment and storage medium
CN113065506A (en) Human body posture recognition method and system
CN115297263B (en) Automatic photographing control method and system suitable for cube shooting and cube shooting
CN115035596B (en) Behavior detection method and device, electronic equipment and storage medium
CN108765384B (en) Significance detection method for joint manifold sequencing and improved convex hull

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant