CN115297263B - Automatic photographing control method and system suitable for cube shooting and cube shooting - Google Patents
Automatic photographing control method and system suitable for cube shooting and cube shooting Download PDFInfo
- Publication number
- CN115297263B CN115297263B CN202211020415.6A CN202211020415A CN115297263B CN 115297263 B CN115297263 B CN 115297263B CN 202211020415 A CN202211020415 A CN 202211020415A CN 115297263 B CN115297263 B CN 115297263B
- Authority
- CN
- China
- Prior art keywords
- gesture
- image
- prediction
- recognized
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 78
- 238000001514 detection method Methods 0.000 claims abstract description 47
- 238000001914 filtration Methods 0.000 claims abstract description 19
- 238000012545 processing Methods 0.000 claims abstract description 16
- 239000011159 matrix material Substances 0.000 claims description 48
- 230000006870 function Effects 0.000 claims description 28
- 238000012549 training Methods 0.000 claims description 26
- 238000010586 diagram Methods 0.000 claims description 19
- 238000011478 gradient descent method Methods 0.000 claims description 13
- 239000000126 substance Substances 0.000 claims description 9
- 230000004927 fusion Effects 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 7
- 230000000052 comparative effect Effects 0.000 claims description 6
- 238000005259 measurement Methods 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 6
- 238000001228 spectrum Methods 0.000 claims description 6
- 230000007704 transition Effects 0.000 claims description 6
- 230000008447 perception Effects 0.000 claims description 5
- 238000004891 communication Methods 0.000 claims description 4
- 229910052731 fluorine Inorganic materials 0.000 claims description 2
- 125000001153 fluoro group Chemical group F* 0.000 claims description 2
- 230000005540 biological transmission Effects 0.000 claims 1
- 230000001815 facial effect Effects 0.000 description 2
- 210000003813 thumb Anatomy 0.000 description 2
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G06T5/77—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/107—Static hand or arm
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/28—Recognition of hand or arm movements, e.g. recognition of deaf sign language
-
- G—PHYSICS
- G07—CHECKING-DEVICES
- G07F—COIN-FREED OR LIKE APPARATUS
- G07F17/00—Coin-freed apparatus for hiring articles; Coin-freed facilities or services
- G07F17/26—Coin-freed apparatus for hiring articles; Coin-freed facilities or services for printing, stamping, franking, typing or teleprinting apparatus
- G07F17/266—Coin-freed apparatus for hiring articles; Coin-freed facilities or services for printing, stamping, franking, typing or teleprinting apparatus for the use of a photocopier or printing device
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30196—Human being; Person
- G06T2207/30201—Face
Abstract
The invention discloses an automatic photographing control method suitable for a cube, which comprises the following steps: s1, performing gesture detection on the first N frames of images of a video stream acquired by a camera based on a gesture detection model of YOLO to obtain a hand area; n is more than or equal to 3; s2, modeling the target motion state of the hand area by using Kalman filtering, and tracking and predicting the position of the hand area in the next frame in real time to obtain a gesture of an object to be recognized; s3, comparing the gesture of the object to be recognized with a reference gesture in a preset gesture template library to calculate similarity; when the calculated similarity is larger than a set threshold value, judging the gesture of the object to be recognized as a photographing triggering gesture, and starting photographing countdown; otherwise, returning to the step S1; s4, when the photographing countdown is finished, controlling the camera to photograph to obtain an original image; and S5, carrying out face beautifying processing on the original image by using the constructed generative confrontation network to obtain a processed target image and outputting the processed target image.
Description
Technical Field
The invention relates to the technical field of self-service photographing, in particular to an automatic photographing control method and system suitable for a cube and the cube.
Background
The shooting cube, also called intelligent shooting box or intelligent shooting pavilion, is placed in public places with large people flow, such as street, subway stations, etc., is convenient for users to simply and conveniently shoot self-service shooting equipment for high-definition photos or certificate photos, etc., is very popular among people, and is spread in all public places such as commercial department stores, colleges and universities, tourist attractions, stations, airports, etc.
The inventor finds that the conventional shooting cube (self-help shooting device) has a space for further improving the shooting control mode and the quality of the shot image in the process of implementing the invention, and the invention aims to simplify the self-help shooting control mode of the shooting cube and improve the quality of the shot image.
Disclosure of Invention
The invention aims to provide an automatic photographing control method and system suitable for a cube shooting and the cube shooting, which can effectively solve the technical problems in the prior art.
In order to achieve the above object, an embodiment of the present invention provides an automatic photographing control method suitable for a photographing cube, wherein the photographing cube includes a control processor, and a camera respectively communicating with the control processor, and the automatic photographing control method includes:
s1, performing gesture detection on the first N frames of images of a video stream acquired by a camera based on a gesture detection model of YOLO to obtain a hand area; n is more than or equal to 3;
s2, modeling the target motion state of the hand area by using Kalman filtering, and tracking and predicting the position of the hand area in the next frame in real time to obtain a gesture of an object to be recognized;
s3, comparing the gesture of the object to be recognized with a reference gesture in a preset gesture template library to calculate similarity; when the calculated similarity is larger than a set threshold value, judging the gesture of the object to be recognized as a photographing triggering gesture, and starting photographing countdown; otherwise, returning to the step S1;
s4, when the photographing countdown is finished, controlling the camera to photograph to obtain an original image;
and S5, carrying out face beautifying processing on the original image by using the constructed generative confrontation network to obtain a processed target image and outputting the processed target image.
Preferably, in step S1, the gesture detection model includes an input unit, an encoder and a decoder, the input unit is configured to input the first N frames of images of the video stream acquired by the camera, and the encoder is configured to perform feature extraction and feature fusion on the first N frames of images to obtain a feature map with richer information; the decoder decodes the feature map into a detection result, thereby obtaining a hand region;
optimizing a loss function L of the following formula (1-1) based on a preset gesture detection training data set, and continuously performing iterative training by a gradient descent method to obtain an optimal detection result:
wherein S is 2 Representing the size of an output layer feature map, namely the number of grids divided by the input image; b is the number of anchors per grid,indicating whether the jth Anchor in the ith grid is responsible for the object prediction, and selecting the largest one with the real frame IoU as a prediction frame; />Indicating that the jth Anchor in the ith mesh indicates that this object prediction is not responsible; x is the number of i ,y i ,w i ,h i And &>The center coordinates and the width and the height of each real label and the prediction frame are respectively expressed and are between 0 and 1, and x i ,y i ,/>Is the relative cell offset; c i And &>Respectively representTrue box confidence and predicted box confidence; p is a radical of i (c) And &>Representing the true object class probability and the prediction class probability of the corresponding unit cell; classes represents the number of object categories; lambda [ alpha ] coord And λ noobj And respectively representing the loss weight of the coordinate prediction of the bounding box and the confidence prediction loss weight of the bounding box without the object.
Preferably, in step S2, the kalman filtering includes a prediction and an update process, the prediction is to estimate a current state according to a state at a previous time, and the update is to modify the prediction information according to the observation value, so as to estimate an optimal state; the method specifically comprises the following formulas (1-2) to (1-6):
and (3) prediction process:
and (3) updating:
wherein the content of the first and second substances,representing a prediction estimate derived from a previous state; />Represents a pair->An updated optimal estimate; a represents a state transition matrix; h represents an observation matrix; b represents an input control matrix; u. of t Representing a system external control quantity; q and R respectively represent a dynamic noise covariance matrix and a measurement noise covariance matrix; k t Representing the Kalman gain at time t; />Representing a prediction error covariance matrix; sigma t Representing a filtering error covariance matrix;
according to the process, the state of the tracker is initialized by utilizing a hand region bounding box obtained by detecting the first N frames of images of the video stream acquired by the camera, and then the position of the estimated target in the next frame is continuously predicted, so that real-time tracking is realized.
Preferably, in the step S3, the to-be-recognized picture with the gesture of the to-be-recognized object and the reference gesture picture in the gesture template library are firstly scaled to the same size, then the euclidean distance is calculated according to the following formula (1-7), and the reference gesture picture with the minimum euclidean distance is found out as the reference gesture picture with the highest similarity:
wherein x is a picture to be recognized, T is a reference gesture picture set in a gesture template library, T is an element in the set T, and T is * The element with the minimum Euclidean distance;
if t is found * Is less than a preset threshold value, t is determined * And the similarity between the gesture of the object to be recognized and the picture to be recognized is greater than a set threshold value, so that the gesture of the object to be recognized is judged as a photographing triggering gesture.
Preferably, in the step S5, the generative countermeasure network includes a generator and an arbiter, the generator is implemented as shown in the following formula (2-1),
G(X)=F(X)+X (2-1)
x is an input original image, F (X) is output obtained after the original image X is processed by a feature learning module F, the feature learning module F mainly adopts a sparse connected parallel structure to extract features, the input original image X is reduced to the original 1/4 size through convolution pooling, then the resolution ratio is kept unchanged on a main path, the resolution ratio on a branch is 1/2 of that of a main branch, high-level semantic features are learned, and finally the high-level semantic features are mixed with the main branch with rich structural information and output;
the input of the discriminator D (X) is X, G (X), after the characteristics are extracted through multilayer convolution, a characteristic diagram with a large receptive field and a small resolution ratio is obtained, and the average value of the characteristic diagram is used as the score of the discriminator.
Preferably, D (X) adopts binary cross entropy as a loss function in the whole model training process; in order to generate a required target image better, the loss target of the generator adds three regularization terms on the basis of GANLOS, calculates the L1 loss of the numerical distance, calculates the perception loss of the multi-level structure similarity and calculates the focal frequency loss of the frequency domain similarity, and specifically comprises the following steps:
the target loss function of the conditional GAN can be expressed as shown in equation (2-2),
L cGAN (G,D)=E x,y [logD(x,y)]+E x [log(1-D(x,G(x)))] (2-2)
where G attempts to minimize this goal and D attempts to maximize, argmin G max D L cGAN (G, D), x is the input original image, and y is the target image;
in order to ensure the consistency of input and output, L1 loss is introduced to carry out constraint such as formula (2-3),
L L1 (G)=E x,y [||y-G(x)|| 1 ] (2-3)
in order to prevent the generated image from losing the structural information, a measure of structural similarity is introduced as in equation (2-4),
L MS-SSIM =1-MS_SSIM(G(x),y) (2-4)
wherein the content of the first and second substances,
the MS _ SSIM method takes a reference image and an original image signal as input, and sequentially applies a low-pass filter to carry out down-sampling on the image by 2 times; assuming that the scale of an original image is 1, obtaining the highest scale M through M-1 iterations; on the jth scale, the comparative contrast and structural similarity, denoted c respectively, are calculated by the following equations (2-6) and (2-7) j (G (x), y) and s j (G(x),y):
The luminance comparison is calculated by the formula (2-8) only at the scale M:
wherein, mu G And mu y Means, σ, representing the generated image and the label image, respectively G And σ y Represents a variance; alpha is alpha M ,β j ,γ j On the corresponding scale, the importance of brightness estimation, contrast and structural similarity components; to simplify the operation, let α j =β j =γ j ,C1, C2, C3 are all constants to avoid instability, C 1 =(K 1 L) 2 ,C 2 =(K 2 L) 2 ,C 3 =C 2 (iv) taking the parameter K 1 =0.01,K 2 =0.03,l is the dynamic range of pixel values;
to further improve the image quality, the gap between the frequency domains is introduced as a constraint term, as in equations (2-9):
wherein MxN represents the size of an image, w (u, v) is a frequency spectrum weight matrix, (u, v) is a frequency domain coordinate, and y (u, v) and G (u, v) correspond to complex frequency values;
the overall optimization objective of the generator is therefore the following equation (2-10):
L total =L cGAN +αL L1 +βL MS-SSIM +γL FFL (2-10)
wherein, α, β and γ represent each loss value weight, and are set according to the requirement of actual training, and the larger the weight of the corresponding constraint term is, the larger the influence on the total target loss is; therefore, through a gradient descent method, the objective function is continuously and iteratively optimized to obtain an optimal solution.
The embodiment of the invention correspondingly provides an automatic photographing control system suitable for a cube shooting body, wherein the cube shooting body comprises a control processor, a camera and a camera, the camera and the camera are respectively communicated with the control processor, and the automatic photographing control system comprises:
the gesture detection module is used for carrying out gesture detection on the first N frames of images of the video stream acquired by the camera based on a gesture detection model of YOLO to obtain a hand area; n is more than or equal to 3;
the hand tracking module is used for modeling the target motion state of the hand area by using Kalman filtering, tracking and predicting the position of the hand area in the next frame in real time and obtaining the gesture of the object to be recognized;
the gesture recognition module is used for comparing the gesture of the object to be recognized with a reference gesture in a preset gesture template library so as to calculate the similarity; when the calculated similarity is larger than a set threshold value, judging the gesture of the object to be recognized as a photographing triggering gesture, and starting photographing countdown; otherwise, returning to the step S1;
the shooting control module is used for controlling the camera to shoot when the shooting countdown is finished so as to obtain an original image;
and the image beautification processing module is used for carrying out face beautification processing on the original image by utilizing the constructed generative confrontation network to obtain a processed target image and outputting the processed target image.
Preferably, in the gesture detection module, the gesture detection model includes an input unit, an encoder and a decoder, the input unit is configured to input the first N frames of images of the video stream acquired by the camera, and the encoder is configured to perform feature extraction and feature fusion on the first N frames of images to obtain a feature map with richer information; the decoder decodes the feature map into a detection result, thereby obtaining a hand region;
optimizing a loss function L of the following formula (1-1) based on a preset gesture detection training data set, and continuously performing iterative training by a gradient descent method to obtain an optimal detection result:
wherein S is 2 Representing the size of the output layer feature map, namely the number of grids divided by the input image; b is the number of anchors per grid,representing whether the jth Anchor in the ith grid is responsible for the object prediction, selecting the largest one with the real frame IoU as the prediction frame; />Indicates that the jth Anchor in the ith mesh indicates that this object prediction is not responsible; x is a radical of a fluorine atom i ,y i ,w i ,h i And &>Representing each real label and predicted bounding box separatelyHas a central coordinate and a width and a height of 0 to 1, x i ,y i ,/>Is the relative cell offset; c i And &>Respectively representing the confidence coefficient of a real frame and the confidence coefficient of a predicted frame; p is a radical of i (c) And &>Representing the true object class probability and the prediction class probability of the corresponding unit cell; classes represents the number of object categories; lambda [ alpha ] coord And λ noobj Respectively representing the loss weight of the coordinate prediction of the bounding box and the confidence degree prediction loss weight of the bounding box without the object;
in the hand tracking module, kalman filtering comprises a prediction process and an update process, wherein the prediction process is to estimate the current state according to the state of the previous moment, and the update process is to correct the prediction information according to an observation value so as to estimate the optimal state; the method specifically comprises the following formulas (1-2) to (1-6):
and (3) prediction process:
and (3) updating:
wherein the content of the first and second substances,representing a prediction estimate derived from a previous state; />Represents a pair->An updated optimal estimate; a represents a state transition matrix; h represents an observation matrix; b represents an input control matrix; u. of t Representing a system external control quantity; q and R respectively represent a dynamic noise covariance matrix and a measurement noise covariance matrix; k is t Representing the Kalman gain at time t; />Representing a prediction error covariance matrix; sigma t Representing a filtering error covariance matrix;
according to the process, initializing the state of the tracker by using a hand region bounding box obtained by detecting the first N frames of images of the video stream acquired by the camera, and then continuously predicting the position of the estimated target in the next frame to realize real-time tracking;
in the gesture recognition module, firstly, a picture to be recognized with the gesture of the object to be recognized and a reference gesture picture in a gesture template library are scaled to be the same in size, then, euclidean distance is calculated according to the following formula (1-7), and the reference gesture picture with the minimum Euclidean distance is found out and used as the reference gesture picture with the highest similarity:
wherein x represents a picture to be recognized, and T represents a reference in a gesture template libraryA set of gesture pictures, T represents an element in the set T, T * The element with the minimum Euclidean distance;
if t is found * Is less than a preset threshold value, t is determined * And the similarity of the image to be recognized and the gesture of the object to be recognized is greater than a set threshold value, so that the gesture of the object to be recognized is determined as a photographing triggering gesture.
Preferably, in the image beautification processing module, the generative countermeasure network comprises a generator and a discriminator, the generator is implemented as shown in the following formula (2-1),
G(X)=F(X)+X (2-1)
x is an input original image, F (X) is output obtained after the original image X is processed by a feature learning module F, the feature learning module F mainly adopts a sparse connected parallel structure to extract features, the input original image X is reduced to the original 1/4 size through convolution pooling, then the resolution ratio is kept unchanged on a main path, the resolution ratio on a branch is 1/2 of that of a main branch, high-level semantic features are learned, and finally the high-level semantic features are mixed with the main branch with rich structural information and output;
the input of the discriminator D (X) is X, G (X), after the characteristics are extracted through multilayer convolution, a characteristic diagram with a large receptive field and a small resolution ratio is obtained, and the average value of the characteristic diagram is used as the score of the discriminator;
in the whole model training process, D (X) adopts binary cross entropy as a loss function; in order to generate a required target image better, the loss target of the generator adds three regularization terms on the basis of GANLOS, calculates the L1 loss of the numerical distance, calculates the perception loss of the multi-level structure similarity and calculates the focal frequency loss of the frequency domain similarity, and specifically comprises the following steps:
the target loss function of the conditional GAN can be expressed as shown in equation (2-2),
L cGAN (G,D)=E x,y [logD(x,y)]+E x [log(1-D(x,G(x)))] (2-2)
where G attempts to minimize this goal and D attempts to maximize, argmin G max D L cGAN (G, D), x is the input original image, and y is the target image;
in order to ensure the consistency of input and output, L1 loss is introduced to carry out constraint such as formula (2-3),
L L1 (G)=E x,y [||y-G(x)|| 1 ] (2-3)
in order to ensure that the generated image does not lose structural information, a measure of structural similarity is introduced as in equation (2-4),
L MS-SSIM =1-MS_SSIM(G(x),y) (2-4)
wherein, the first and the second end of the pipe are connected with each other,
the MS _ SSIM method takes a reference image and an original image signal as input, and sequentially applies a low-pass filter to carry out down-sampling on the image by 2 times; assuming that the scale of an original image is 1, and obtaining the highest scale M through M-1 iterations; on the jth scale, the comparative contrast and structural similarity, denoted c respectively, are calculated by the following equations (2-6) and (2-7) j (G (x), y) and s j (G(x),y):
The luminance comparison is calculated by the formula (2-8) only at the scale M:
wherein, mu G And mu y Means, σ, representing the generated image and the label image, respectively G And σ y Represents the variance; alpha is alpha M ,β j ,γ j On the corresponding scale, luminance estimation, contrast and texture phaseThe importance of similarity components; to simplify the operation, let α j =β j =γ j ,C1, C2, C3 are all constants to avoid instability, C 1 =(K 1 L) 2 ,C 2 =(K 2 L) 2 ,C 3 =C 2 (ii)/2, taking parameter K 1 =0.01,K 2 =0.03, l is the dynamic range of pixel values;
to further improve the image quality, the gap between the frequency domains is introduced as a constraint term, as in equations (2-9):
wherein MxN represents the size of an image, w (u, v) is a frequency spectrum weight matrix, (u, v) is a frequency domain coordinate, and y (u, v) and G (u, v) correspond to complex frequency values;
the overall optimization objective of the generator is therefore the following equation (2-10):
L total =L cGAN +αL L1 +βL MS-SSIM +γL FFL (2-10)
wherein, alpha, beta and gamma represent each loss value weight, and are set according to the requirement of actual training, and the larger the weight of the corresponding constraint term is, the larger the influence on the total target loss is; therefore, through a gradient descent method, the objective function is continuously and iteratively optimized to obtain an optimal solution.
The embodiment of the invention also provides a shooting cube, which comprises a control processor, and a camera which are respectively communicated with the control processor, wherein the camera is used for acquiring the video stream of an object to be shot in the shooting cube and transmitting the video stream to the control processor; the control processor is used for controlling the camera to shoot according to the received video stream and outputting the shot image after processing the shot image; wherein the control processor is configured to be able to execute the automatic photographing control method for a cube as described in any of the above embodiments.
Compared with the prior art, the automatic photographing control method and system suitable for the cube and the cube provided by the embodiment of the invention can effectively simplify the automatic photographing of the cube, a photographer can automatically trigger the camera to finish photographing through gestures without touching any physical button, and images photographed through the camera can be processed through the portrait facial skin, so that the skin looks more textured, healthy and beautiful, the image quality of photographed pictures is improved, and the use experience of users is effectively improved.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings required to be used in the embodiments will be briefly described below, and obviously, the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a schematic flow chart of an automatic photo taking control method suitable for a cube taking according to an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of a gesture detection model involved in an automatic photo-taking control method suitable for a cube taking device according to an embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a generator of a generative confrontation network involved in an automatic photograph control method suitable for a cube.
Fig. 4 is a schematic structural diagram of a feature learning module of a generative confrontation network involved in an automatic photo-taking control method for a cube according to an embodiment of the present invention.
Fig. 5 is a schematic structural diagram of an arbiter of a generative confrontation network involved in an automatic photograph control method for a cube according to an embodiment of the present invention.
Fig. 6 is a schematic structural diagram of an automatic photographing control system suitable for a cube.
Fig. 7 is a schematic structural diagram of a swatter cube according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", and the like, indicate orientations and positional relationships based on those shown in the drawings, and are used only for convenience of description and simplicity of description, and do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered as limiting the present invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; may be mechanically, electrically or otherwise in communication with each other; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the present invention can be understood according to specific situations by those of ordinary skill in the art.
Referring to fig. 1, an embodiment of the present invention provides an automatic photographing control method suitable for a cube, where the cube includes a control processor, and a camera that respectively communicate with the control processor, and the automatic photographing control method includes:
s1, performing gesture detection on the first N frames of images of a video stream acquired by a camera based on a gesture detection model of YOLO to obtain a hand area; n is more than or equal to 3;
s2, modeling the target motion state of the hand area by using Kalman filtering, and tracking and predicting the position of the hand area in the next frame in real time to obtain a gesture of an object to be recognized;
s3, comparing the gesture of the object to be recognized with a reference gesture in a preset gesture template library to calculate similarity; when the calculated similarity is larger than a set threshold value, judging the gesture of the object to be recognized as a photographing triggering gesture, and starting photographing countdown; otherwise, returning to the step S1;
s4, when the photographing countdown is finished, controlling the camera to photograph to obtain an original image;
and S5, carrying out face beautifying processing on the original image by using the constructed generative confrontation network to obtain a processed target image and outputting the processed target image.
Specifically, as shown in fig. 2, in step S1, the gesture detection model includes an input unit 21, an encoder 22, and a decoder 23, where the input unit 21 is configured to input a first N frames of images of a video stream acquired by the camera, and the encoder 22 is configured to perform feature extraction and feature fusion on the first N frames of images to obtain a feature map with richer information; the decoder 23 decodes the feature map into a detection result, thereby obtaining a hand region;
the feature extraction backbone network comprises 2x, 4x and 6x Mobilenetv2 blocks in one step, feature fusion adopts FPN to perform feature map fusion, and a decoding part YOLO regression classification.
Specifically, a loss function L of the following formula (1-1) is optimized based on a preset gesture detection training data set, and an optimal detection result is obtained through continuous iterative training by a gradient descent method:
wherein S is 2 Representing the size of the output layer feature map, namely the number of grids divided by the input image; b is the number of anchors per grid,representing whether the jth Anchor in the ith grid is responsible for the object prediction, selecting the largest one with the real frame IoU as the prediction frame; />Indicating that the jth Anchor in the ith mesh indicates that this object prediction is not responsible; x is the number of i ,y i ,w i ,h i And &>The center coordinates and the width and the height of each real label and the prediction frame are respectively expressed and are between 0 and 1, and x i ,y i ,/>Is the relative cell offset; c i And &>Respectively representing the confidence coefficient of a real frame and the confidence coefficient of a predicted frame; p is a radical of i (c) And &>Representing the true object class probability and the prediction class probability of the corresponding unit cell; classes represents the number of object categories; lambda [ alpha ] coord And λ noobj Respectively representing coordinate predictions of bounding boxesA loss weight and a confidence prediction loss weight for a bounding box that does not contain an object.
It is understood that the preset gesture detection training data is about 2 ten thousand pieces, and the categories include palm, fist, thumb, heart of one hand, scissor hand and others. In the gesture detection process, the type of the gesture is not concerned, and the positioning function is mainly utilized. And optimizing the objective function on the data set, and continuously performing iterative training by a gradient descent method to obtain an optimal solution.
Further, in step S2, the kalman filtering includes a prediction process and an update process, the prediction process estimates a current state according to a state at a previous time, and the update process corrects prediction information according to an observation value, so as to estimate an optimal state; the method specifically comprises the following formulas (1-2) to (1-6):
and (3) a prediction process:
and (3) updating:
wherein the content of the first and second substances,representing a prediction estimate derived from a previous state; />Representing a pair>An updated optimal estimate; a represents a state transition matrix; h represents an observation matrix; b represents an input control matrix; u. of t Representing a system external control quantity; q and R respectively represent a dynamic noise covariance matrix and a measurement noise covariance matrix; k is t Representing the Kalman gain at time t; />Representing a prediction error covariance matrix; sigma t Representing a filtering error covariance matrix;
according to the process, the state of the tracker is initialized by utilizing a hand region bounding box obtained by detecting the first N frames of images of the video stream acquired by the camera, and then the position of the estimated target in the next frame is continuously predicted, so that real-time tracking is realized.
Further, in step S3, the to-be-recognized picture with the to-be-recognized object gesture and the reference gesture picture in the gesture template library are first scaled to the same size, then the euclidean distance is calculated according to the following formula (1-7), and the reference gesture picture with the minimum euclidean distance is found out as the reference gesture picture with the highest similarity:
wherein x is a picture to be recognized, T is a reference gesture picture set in a gesture template library, T is an element in the set T, and T is * The element with the minimum Euclidean distance;
if t is found * Is less than a preset threshold value, t is determined * And the similarity of the image to be recognized and the gesture of the object to be recognized is greater than a set threshold value, so that the gesture of the object to be recognized is determined as a photographing triggering gesture.
The process of matching the gesture template can be described as solving the distance from a point to each element point in the set, and the best matching is to find the point with the minimum distance from the point to the element in the set, and to note t * 。
It can be seen that, firstly, by establishing a template library, some types of gesture pictures taken from a plurality of different angles are collected, and the types include "palm", "fist", "thumb", "single-hand heart", "scissor hand" and "other", which are classified into 6 types. And then, the picture to be identified and the template picture are scaled to the same size, the Euclidean distance is calculated according to the pixels corresponding to the formula (1-7), and the template picture with the highest similarity, namely the template with the smallest distance, is found out. If the distance is less than the set threshold value, the two pictures are very close to each other, and therefore t is judged * The category is the identification result of the picture to be detected.
Further, referring to fig. 3 to 5, in step S5, the constructed generative confrontation network is used to perform face beautification on the original image, so as to obtain a processed target image and output the processed target image. The face beautifying treatment mainly removes facial flaws and realizes the function of skin grinding, so that the skin looks more textured, healthy and beautiful, and the image quality of the shot photos is improved.
Specifically, the generative countermeasure network includes a generator and a discriminator, and fig. 4 is a detailed diagram of the F function in fig. 3. Considering that the output image G (X) is a slight change in the characteristics of the input image X, such as texture, color, etc., the generator adopts a residual structure in order to ensure the generated image quality, and the learning bias F (X) makes it easier for the model to learn to converge.
The generator is implemented as shown in the following formula (2-1),
G(X)=F(X)+X (2-1)
x is an input original image, F (X) is output obtained after the original image X is processed by a feature learning module F, the feature learning module F mainly adopts a sparse connected parallel structure to extract features, the input original image X is reduced to the original 1/4 size through convolution pooling, then the resolution ratio is kept unchanged on a main path, the resolution ratio on a branch is 1/2 of that of a main branch, high-level semantic features are learned, and finally the high-level semantic features are mixed with the main branch with rich structural information and output.
The input of the discriminator D (X) is X, G (X), after the characteristics are extracted through multilayer convolution, a characteristic diagram with a large receptive field and a small resolution ratio is obtained, and the average value of the characteristic diagram is used as the score of the discriminator.
In the whole model training process, D (X) adopts binary cross entropy as a loss function; in order to generate a required target image better, the Loss target of the generator adds three regularization terms on the basis of GAN Loss, calculates L1 Loss of numerical distance, calculates multi-level structural similarity (MS-SSIM) perceptual Loss, and calculates Focal Frequency Loss (FFL) of Frequency domain similarity, as follows:
the target loss function of the conditional GAN can be expressed as shown in equation (2-2),
L cGAN (G,D)=E x,y [logD(x,y)]+E x [log(1-D(x,G(x)))] (2-2)
where G attempts to minimize this goal and D attempts to maximize, argmin G max D L cGAN (G, D), x is the input original image, and y is the target image;
in order to ensure the consistency of input and output, L1 loss is introduced to carry out constraint such as formula (2-3),
L L1 (G)=E x,y [||y-G(x)|| 1 ] (2-3)
in order to prevent the generated image from losing the structural information, a measure of structural similarity is introduced as in equation (2-4),
L MS-SSIM =1-MS_SSIM(G(x),y) (2-4)
wherein the content of the first and second substances,
the MS _ SSIM method takes a reference image and an original image signal as input, and sequentially applies a low-pass filter to carry out down-sampling on the image by 2 times; assuming that the original image scale is 1, obtaining the highest ruler through M-1 iterationsThe degree is M; on the jth scale, the comparative contrast and structural similarity, denoted c respectively, are calculated by the following equations (2-6) and (2-7) j (G (x), y) and s j (G(x),y):
The luminance comparison is calculated by the formula (2-8) only at the scale M:
wherein, mu G And mu y Mean, σ, representing the generated image and the label image, respectively G And σ y Represents the variance; alpha (alpha) ("alpha") M ,β j ,γ j On the corresponding scale, the importance of brightness estimation, contrast and structural similarity components; to simplify the operation, let α j =β j =γ j ,C1, C2, C3 are all constants to avoid instability, C 1 =(K 1 L) 2 ,C 2 =(K 2 L) 2 ,C 3 =C 2 (iv) taking the parameter K 1 =0.01,K 2 =0.03,l is the dynamic range of pixel values;
to further improve the image quality, the gap between the frequency domains is introduced as a constraint term, as in equations (2-9):
wherein MxN represents the size of an image, w (u, v) is a frequency spectrum weight matrix, (u, v) is a frequency domain coordinate, and y (u, v) and G (u, v) correspond to complex frequency values;
the overall optimization objective of the generator is therefore the following equation (2-10):
L total =L cGAN +αL L1 +βL MS-SSIM +γL FFL (2-10)
wherein, α, β and γ represent each loss value weight, and are set according to the requirement of actual training, and the larger the weight of the corresponding constraint term is, the larger the influence on the total target loss is; therefore, through a gradient descent method, the objective function is continuously and iteratively optimized to obtain an optimal solution.
Referring to fig. 6, an embodiment of the present invention provides an automatic photographing control system suitable for a photographing cube, where the photographing cube includes a control processor, and a camera that respectively communicate with the control processor, and the automatic photographing control system includes:
the gesture detection module 61 is configured to perform gesture detection on the first N frames of images of the video stream acquired by the camera based on a gesture detection model of the YOLO to obtain a hand region; n is more than or equal to 3;
the hand tracking module 62 is used for modeling the target motion state of the hand region by using kalman filtering, tracking and predicting the position of the hand region in the next frame in real time, and obtaining the gesture of the object to be recognized;
the gesture recognition module 63 is configured to compare the gesture of the object to be recognized with a reference gesture in a preset gesture template library to calculate a similarity; when the calculated similarity is larger than a set threshold value, judging the gesture of the object to be recognized as a photographing triggering gesture, and starting photographing countdown; otherwise, returning to the step S1;
the photographing control module 64 is configured to control the camera to photograph when the photographing countdown is finished, so as to obtain an original image;
and the image beautification processing module 65 performs face beautification processing on the original image by using the constructed generative countermeasure network to obtain a processed target image and outputs the processed target image.
Specifically, in the gesture detection module 61, as shown in fig. 2, the gesture detection model includes an input unit 21, an encoder 22 and a decoder 23, the input unit 21 is configured to input the first N frames of images of the video stream acquired by the camera, and the encoder 22 is configured to perform feature extraction and feature fusion on the first N frames of images to obtain a feature map with richer information; the decoder 23 decodes the feature map into a detection result, thereby obtaining a hand region;
optimizing a loss function L of the following formula (1-1) based on a preset gesture detection training data set, and continuously performing iterative training by a gradient descent method to obtain an optimal detection result:
wherein S is 2 Representing the size of the output layer feature map, namely the number of grids divided by the input image; b is the number of anchors per grid,indicating whether the jth Anchor in the ith grid is responsible for the object prediction, and selecting the largest one with the real frame IoU as a prediction frame; />Indicates that the jth Anchor in the ith mesh indicates that this object prediction is not responsible; x is the number of i ,y i ,w i ,h i And &>The center coordinates and the width and the height of each real label and the prediction frame are respectively expressed and are between 0 and 1, and x i ,y i ,/>Is the relative cell offset; c i And &>Respectively representing the confidence coefficient of a real frame and the confidence coefficient of a prediction frame; p is a radical of i (c) And &>Representing the true object class probability and the prediction class probability of the corresponding unit cell; classes represents the number of object categories; lambda coord And λ noobj Respectively representing the loss weight of the coordinate prediction of the bounding box and the confidence degree prediction loss weight of the bounding box without the object;
specifically, in the hand tracking module 92, kalman filtering includes a prediction process and an update process, where the prediction process estimates a current state according to a state at a previous time, and the update process corrects prediction information according to an observed value, so as to estimate an optimal state; the method specifically comprises the following formulas (1-2) to (1-6):
and (3) prediction process:
and (3) updating:
wherein, the first and the second end of the pipe are connected with each other,representing a prediction estimate derived from a previous state; />Representing a pair>An updated optimal estimate; a represents a state transition matrix; h represents an observation matrix; b represents an input control matrix; u. of t Representing a system external control quantity; q and R respectively represent a dynamic noise covariance matrix and a measurement noise covariance matrix; k t Representing the Kalman gain at time t; />Representing a prediction error covariance matrix; sigma t Representing a filtering error covariance matrix;
according to the process, initializing the state of the tracker by using a hand region bounding box obtained by detecting the first N frames of images of the video stream acquired by the camera, and then continuously predicting the position of the estimated target in the next frame to realize real-time tracking;
further, in the gesture recognition module 63, first, the to-be-recognized picture with the gesture of the to-be-recognized object and the reference gesture picture in the gesture template library are scaled to the same size, then, the euclidean distance is calculated according to the following formula (1-7), and the reference gesture picture with the minimum euclidean distance is found out to be the reference gesture picture with the highest similarity:
wherein x is a picture to be recognized, T is a reference gesture picture set in a gesture template library, T is an element in the set T, and T is * The element with the minimum Euclidean distance;
if found t * Is less than a preset threshold value, t is determined * And the similarity of the image to be recognized and the gesture of the object to be recognized is greater than a set threshold value, so that the gesture of the object to be recognized is determined as a photographing triggering gesture.
Further, in the image beautification processing module 65, the generative confrontation network includes a generator and a discriminator, the generator is implemented as shown in the following formula (2-1),
G(X)=F(X)+X (2-1)
x is an input original image, F (X) is output obtained after the original image X is processed by a feature learning module F, the feature learning module F mainly adopts a sparse connected parallel structure to extract features, the input original image X is reduced to the original size of 1/4 through convolution pooling, then the resolution is kept unchanged on a main path, the resolution on branches is 1/2 of that of a main branch, high-level semantic features are learned, and finally the high-level semantic features are mixed with the main branch with rich structural information and output;
the input of the discriminator D (X) is X, G (X), after the characteristics are extracted through multilayer convolution, a characteristic diagram with a large receptive field and a small resolution ratio is obtained, and the average value of the characteristic diagram is used as the score of the discriminator;
in the whole model training process, D (X) adopts binary cross entropy as a loss function; in order to generate a required target image better, the loss target of the generator adds three regularization terms on the basis of GANLOS, calculates the L1 loss of the numerical distance, calculates the perception loss of the multi-level structure similarity and calculates the focal frequency loss of the frequency domain similarity, and specifically comprises the following steps:
the target loss function of the conditional GAN can be expressed as shown in equation (2-2),
L cGAN (G,D)=E x,y [logD(x,y)]+E x [log(1-D(x,G(x)))] (2-2)
where G attempts to minimize this goal and D attempts to maximize, argmin G max D L cGAN (G, D), wherein x is an input original image, and y is a target image;
in order to ensure the consistency of input and output, L1 loss is introduced to carry out constraint such as formula (2-3),
L L1 (G)=E x,y [||y-G(x)|| 1 ] (2-3)
in order to prevent the generated image from losing the structural information, a measure of structural similarity is introduced as in equation (2-4),
L MS-SSIM =1-MS_SSIM(G(x),y) (2-4)
wherein the content of the first and second substances,
the MS _ SSIM method takes a reference image and an original image signal as input, and sequentially applies a low-pass filter to carry out down-sampling on the image by 2 times; assuming that the scale of an original image is 1, obtaining the highest scale M through M-1 iterations; on the jth scale, the comparative contrast and structural similarity, denoted c respectively, are calculated by the following equations (2-6) and (2-7) j (G (x), y) and s j (G(x),y):
The luminance comparison is calculated by the formula (2-8) only at the scale M:
wherein, mu G And mu y Means, σ, representing the generated image and the label image, respectively G And σ y Represents the variance; alpha is alpha M ,β j ,γ j On the corresponding scale, the importance of brightness estimation, contrast and structural similarity components; to simplify the operation, let α j =β j =γ j ,C1, C2, C3 are all constants to avoid instability, C 1 =(K 1 L) 2 ,C 2 =(K 2 L) 2 ,C 3 =C 2 (ii)/2, taking parameter K 1 =0.01,K 2 =003, L is the dynamic range of the pixel values;
to further improve the image quality, the gap between the frequency domains is introduced as a constraint term, as in equations (2-9):
wherein MxN represents the size of an image, w (u, v) is a frequency spectrum weight matrix, (u, v) is a frequency domain coordinate, and y (u, v) and G (u, v) correspond to complex frequency values;
the overall optimization objective of the generator is therefore the following equation (2-10):
L total =L cGAN +αL L1 +βL MS-SSIM +γL FFL (2-10)
wherein, α, β and γ represent each loss value weight, and are set according to the requirement of actual training, and the larger the weight of the corresponding constraint term is, the larger the influence on the total target loss is; therefore, through a gradient descent method, the objective function is continuously and iteratively optimized to obtain an optimal solution.
Referring to fig. 7, an embodiment of the present invention provides a cube shooting apparatus, where the cube shooting apparatus includes a control processor 1, and a camera 2 and a camera 3 that are respectively in communication with the control processor 1, where the camera 2 is configured to obtain a video stream of an object to be shot in the cube shooting apparatus, and transmit the video stream to the control processor 1; the control processor 1 is used for controlling the camera 3 to shoot according to the received video stream and outputting the shot image after processing; wherein the control processor 1 is arranged to be able to execute an automatic photographing control method suitable for cubic photographing as described in any of the above embodiments.
To sum up, the automatic photographing control method and system suitable for the cube photographing and the cube photographing can effectively simplify the automatic photographing of the cube photographing, a photographer can automatically trigger the camera to complete photographing through gestures without touching any physical button, and images photographed through the camera can be processed through the face skin of the portrait, so that the skin looks more textured, healthy and attractive, the image quality of photographed photos is improved, and the use experience of users is effectively improved.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (7)
1. An automatic photographing control method suitable for a cube, wherein the cube comprises a control processor, and a camera which are respectively communicated with the control processor, the automatic photographing control method comprises the following steps:
s1, performing gesture detection on the first N frames of images of a video stream acquired by a camera based on a gesture detection model of YOLO to obtain a hand area; n is more than or equal to 3;
s2, modeling the target motion state of the hand area by using Kalman filtering, and tracking and predicting the position of the hand area in the next frame in real time to obtain a gesture of an object to be recognized;
s3, comparing the gesture of the object to be recognized with a reference gesture in a preset gesture template library to calculate similarity; when the calculated similarity is larger than a set threshold value, judging the gesture of the object to be recognized as a photographing triggering gesture, and starting photographing countdown; otherwise, returning to the step S1;
s4, when the photographing countdown is finished, controlling the camera to photograph to obtain an original image;
s5, carrying out face beautifying processing on the original image by using the constructed generative countermeasure network to obtain a processed target image and outputting the processed target image;
in step S5, the generative confrontation network includes a generator and an arbiter, the generator is implemented as shown in the following formula (2-1),
G(X)=F(X)+X (2-1)
x is an input original image, F (X) is output obtained after the original image X is processed by a feature learning module F, the feature learning module F mainly adopts a sparse connected parallel structure to extract features, the input original image X is reduced to the original 1/4 size through convolution pooling, then the resolution ratio is kept unchanged on a main path, the resolution ratio on a branch is 1/2 of that of a main branch, high-level semantic features are learned, and finally the high-level semantic features are mixed with the main branch with rich structural information and output;
the input of the discriminator D (X) is X, G (X), after the characteristics are extracted through multilayer convolution, a characteristic diagram with a large receptive field and a small resolution ratio is obtained, and the average value of the characteristic diagram is used as the score of the discriminator;
in the whole model training process, D (X) adopts binary cross entropy as a loss function; in order to generate a required target image better, the loss target of the generator adds three regularization terms on the basis of GANLOS, calculates the L1 loss of the numerical distance, calculates the perception loss of the multi-level structure similarity and calculates the focal frequency loss of the frequency domain similarity, and specifically comprises the following steps:
the target loss function of the conditional GAN can be expressed as shown in equation (2-2),
L cGAN (G,D)=E x,y [logD(x,y)]+E x [log(1-D(x,G(x)))] (2-2)
where G attempts to minimize this goal and D attempts to maximize, argmin G max D L cGAN (G, D), x is the input original image, and y is the target image;
in order to ensure the consistency of input and output, L1 loss is introduced to carry out constraint such as formula (2-3),
L L1 (G)=E x,y [||y-G(x)|| 1 ] (2-3)
in order to prevent the generated image from losing the structural information, a measure of structural similarity is introduced as in equation (2-4),
L MS-SSIM =1-MS_SSIM(G(x),y) (2-4)
wherein the content of the first and second substances,
the MS _ SSIM method is to use a reference imageAnd the original image signal is input, and a low-pass filter is sequentially applied to carry out down-sampling on the image by 2 times; assuming that the scale of an original image is 1, and obtaining the highest scale M through M-1 iterations; on the jth scale, the comparative contrast and structural similarity, denoted c respectively, are calculated by the following equations (2-6) and (2-7) j (G (x), y) and s j (G(x),y):
The luminance comparison is calculated by the formula (2-8) only at the scale M:
wherein, mu G And mu y Means, σ, representing the generated image and the label image, respectively G And σ y Represents a variance; alpha is alpha M ,β j ,γ j On the corresponding scale, the importance of brightness estimation, contrast and structural similarity components; to simplify the operation, let α j =β j =γ j ,C1, C2, C3 are all constants to avoid instability, C 1 =(K 1 L) 2 ,C 2 =(K 2 L) 2 ,C 3 =C 2 (iv) taking the parameter K 1 =0.01,K 2 =0.03, l is the dynamic range of pixel values;
to further improve the image quality, the gap between the frequency domains is introduced as a constraint term, as in equations (2-9):
wherein, mxN represents the size of an image, w (u, v) spectrum weight matrix, (u, v) is frequency domain coordinates, and y (u, v) and G (u, v) correspond to complex frequency values;
the overall optimization objective of the generator is therefore the following equation (2-10):
L total =L cGAN +αL L1 +βL MS-SSIM +γL FFL (2-10)
wherein, α, β and γ represent each loss value weight, and are set according to the requirement of actual training, and the larger the weight of the corresponding constraint term is, the larger the influence on the total target loss is; therefore, through a gradient descent method, the objective function is continuously and iteratively optimized to obtain an optimal solution.
2. The automatic photography control method for a cube according to claim 1,
in the step S1, the gesture detection model includes an input unit, an encoder and a decoder, the input unit is configured to input the first N frames of images of the video stream acquired by the camera, and the encoder is configured to perform feature extraction and feature fusion on the first N frames of images to obtain a feature map with richer information; the decoder decodes the feature map into a detection result, thereby obtaining a hand region;
optimizing a loss function L of the following formula (1-1) based on a preset gesture detection training data set, and continuously performing iterative training by a gradient descent method to obtain an optimal detection result:
wherein, S2 represents the size of the feature map of the output layer, namely the number of grids divided by the input image; b is the number of anchors per grid,to representWhether the jth Anchor in the ith grid is responsible for the object prediction or not is determined, and the largest Anchor in the ith grid is selected as a prediction frame from the real frame IoU; />Indicating that the jth Anchor in the ith mesh indicates that this object prediction is not responsible; x is the number of i ,y i ,w i ,h i And &>The center coordinates and the width and the height of each real label and the prediction frame are respectively expressed and are between 0 and 1, and x i ,y i ,/>Is the relative cell offset; c i And &>Respectively representing the confidence coefficient of a real frame and the confidence coefficient of a prediction frame; p is a radical of i (c) And &>Representing the true object class probability and the prediction class probability of the corresponding unit cell; class represents the number of object categories; lambda [ alpha ] coord And λ noobj And respectively representing the loss weight of the coordinate prediction of the bounding box and the confidence prediction loss weight of the bounding box without the object.
3. The automatic photographing control method for a cube of claim 1,
in the step S2, the kalman filtering includes a prediction process and an update process, the prediction process estimates a current state according to a state at a previous time, and the update process corrects prediction information according to an observation value, so as to estimate an optimal state; the method specifically comprises the following formulas (1-2) to (1-6):
and (3) prediction process:
and (3) updating:
wherein, the first and the second end of the pipe are connected with each other,representing a prediction estimate derived from a previous state; />Represents a pair->An updated optimal estimate; a represents a state transition matrix; h represents an observation matrix; b represents an input control matrix; u. of t Representing a system external control quantity; q and R respectively represent a dynamic noise covariance matrix and a measurement noise covariance matrix; k is t Representing the Kalman gain at time t; />Representing a prediction error covariance matrix;Σ t representing a filtering error covariance matrix;
according to the process, the state of the tracker is initialized by utilizing a hand region bounding box obtained by detecting the first N frames of images of the video stream acquired by the camera, and then the position of the estimated target in the next frame is continuously predicted, so that real-time tracking is realized.
4. The automatic photographing control method for a cube photographing according to claim 1,
in the step S3, the picture to be recognized with the gesture of the object to be recognized and the reference gesture picture in the gesture template library are firstly scaled to the same size, then the euclidean distance is calculated according to the following formula (1-7), and the reference gesture picture with the minimum euclidean distance is found out as the reference gesture picture with the highest similarity:
wherein x represents a picture to be recognized, T represents a reference gesture picture set in a gesture template library, T represents an element in the set T, and T represents * The element with the minimum Euclidean distance;
if found t * Is less than the preset threshold value, then t is determined * And the similarity between the gesture of the object to be recognized and the picture to be recognized is greater than a set threshold value, so that the gesture of the object to be recognized is judged as a photographing triggering gesture.
5. The utility model provides an automatic control system that shoots suitable for clap cube, wherein, clap cube include control processor and respectively with control processor communication's camera and camera, its characterized in that, automatic control system that shoots includes:
the gesture detection module is used for performing gesture detection on the first N frames of images of the video stream acquired by the camera based on a gesture detection model of the YOLO to obtain a hand area; n is more than or equal to 3;
the hand tracking module is used for modeling the target motion state of the hand area by using Kalman filtering, tracking and predicting the position of the hand area in the next frame in real time and obtaining the gesture of the object to be recognized;
the gesture recognition module is used for comparing the gesture of the object to be recognized with a reference gesture in a preset gesture template library so as to calculate the similarity; when the calculated similarity is larger than a set threshold value, judging the gesture of the object to be recognized as a photographing triggering gesture, and starting photographing countdown; otherwise, returning to the step S1;
the shooting control module is used for controlling the camera to shoot when the shooting countdown is finished so as to obtain an original image;
the image beautification processing module is used for carrying out face beautification processing on the original image by utilizing the constructed generative countermeasure network to obtain a processed target image and outputting the processed target image;
in the image beautification processing module, the generative confrontation network comprises a generator and an arbiter, the generator is implemented as the following formula (2-1),
G(X)=F(X)+X (2-1)
x is an input original image, F (X) is output obtained after the original image X is processed by a feature learning module F, the feature learning module F mainly adopts a sparse connected parallel structure to extract features, the input original image X is reduced to the original 1/4 size through convolution pooling, then the resolution ratio is kept unchanged on a main path, the resolution ratio on a branch is 1/2 of that of a main branch, high-level semantic features are learned, and finally the high-level semantic features are mixed with the main branch with rich structural information and output;
the input of the discriminator D (X) is X, G (X), after the characteristics are extracted through multilayer convolution, a characteristic diagram with a large receptive field and a small resolution ratio is obtained, and the average value of the characteristic diagram is used as the score of the discriminator;
in the whole model training process, D (X) adopts binary cross entropy as a loss function; in order to generate a required target image better, the loss target of the generator adds three regularization terms on the basis of GANLOS, calculates the L1 loss of the numerical distance, calculates the perception loss of the multi-level structure similarity and calculates the focal frequency loss of the frequency domain similarity, and specifically comprises the following steps:
the target loss function of the conditional GAN can be expressed as shown in equation (2-2),
L cGAN (G,D)=E x,y [logD(x,y)]+E x [log(1-D(x,G(x)))] (2-2)
where G tries to minimize this goal and D tries to maximize, argmin G max D L cGAN (G, D), x is the input original image, and y is the target image;
in order to ensure the consistency of input and output, L1 loss is introduced to carry out constraint such as formula (2-3),
L L1 (G)=E x,y [||y-G(x)|| 1 ] (2-3)
in order to ensure that the generated image does not lose structural information, a measure of structural similarity is introduced as in equation (2-4),
L MS-SSIM =1-MS_SSIM(G(x),y) (2-4)
wherein the content of the first and second substances,
the MS _ SSIM method takes a reference image and an original image signal as input, and sequentially applies a low-pass filter to carry out down-sampling on the image by 2 times; assuming that the scale of an original image is 1, obtaining the highest scale M through M-1 iterations; on the jth scale, the comparative contrast and the structural similarity, denoted c respectively, are calculated by the following equations (2-6) and (2-7) j (G (x), y) and s j (G(x),y):
The luminance comparison is calculated by the formula (2-8) only at the scale M:
wherein, mu G And mu y Means, σ, representing the generated image and the label image, respectively G And σ y Represents the variance; alpha is alpha M ,β j ,γ j On the corresponding scale, the importance of brightness estimation, contrast and structural similarity components; to simplify the operation, let α j =β j =γ j ,C1, C2, C3 are all constants to avoid instability, C 1 =(K 1 L) 2 ,C 2 =(K 2 L) 2 ,C 3 =C 2 (ii)/2, taking parameter K 1 =0.01,K 2 =0.03,l is the dynamic range of pixel values;
to further improve the image quality, the gap between the frequency domains is introduced as a constraint term, as in equations (2-9):
wherein MxN represents the size of an image, w (u, v) is a frequency spectrum weight matrix, (u, v) is a frequency domain coordinate, and y (u, v) and G (u, v) correspond to complex frequency values;
the overall optimization objective of the generator is therefore the following equation (2-10):
L total =L cGAN +αL L1 +βL MS-SSIM +γL FFL (2-10)
wherein, α, β and γ represent each loss value weight, and are set according to the requirement of actual training, and the larger the weight of the corresponding constraint term is, the larger the influence on the total target loss is; therefore, through a gradient descent method, the objective function is continuously and iteratively optimized to obtain an optimal solution.
6. The automated photography control system for a cube according to claim 5,
in the gesture detection module, the gesture detection model comprises an input unit, an encoder and a decoder, wherein the input unit is used for inputting the first N frames of images of the video stream acquired by the camera, and the encoder is used for performing feature extraction and feature fusion on the first N frames of images to acquire a feature map with richer information; the decoder decodes the feature map into a detection result, thereby obtaining a hand region;
optimizing a loss function L of the following formula (1-1) based on a preset gesture detection training data set, and continuously performing iterative training by a gradient descent method to obtain an optimal detection result:
wherein S is 2 Representing the size of the output layer feature map, namely the number of grids divided by the input image; b is the number of anchors per mesh,representing whether the jth Anchor in the ith grid is responsible for the object prediction, selecting the largest one with the real frame IoU as the prediction frame; />Indicating that the jth Anchor in the ith mesh indicates that this object prediction is not responsible; x is a radical of a fluorine atom i ,y i ,w i ,h i And &>The central coordinates and the width and the height of each real label and the predicted frame are respectively represented and are between 0 and 1, x i ,y i ,/>Is the relative cell offset; c i And &>Respectively representing the confidence coefficient of a real frame and the confidence coefficient of a prediction frame; p is a radical of i (c) And &>Representing the true object class probability and the prediction class probability of the corresponding unit cell; classes represents the number of object categories; lambda [ alpha ] coord And λ noobj Respectively representing the loss weight of the coordinate prediction of the bounding box and the confidence degree prediction loss weight of the bounding box without the object;
in the hand tracking module, kalman filtering comprises a prediction process and an update process, wherein the prediction process is to estimate the current state according to the state of the previous moment, and the update process is to correct the prediction information according to an observation value so as to estimate the optimal state; the method specifically comprises the following formulas (1-2) to (1-6):
and (3) prediction process:
and (3) updating:
wherein the content of the first and second substances,representing a prediction estimate derived from a previous state; />Represents a pair->An updated optimal estimate; a represents a state transition matrix; h represents an observation matrix; b represents an input control matrix; u. of t Representing the external control quantity of the system; q and R respectively represent a dynamic noise covariance matrix and a measurement noise covariance matrix; k t Representing the Kalman gain at time t; />Representing a prediction error covariance matrix; sigma t Representing a filtering error covariance matrix;
according to the process, initializing the state of the tracker by using a hand region bounding box obtained by detecting the first N frames of images of the video stream acquired by the camera, and then continuously predicting the position of an estimated target in the next frame to realize real-time tracking;
in the gesture recognition module, firstly, the picture to be recognized with the gesture of the object to be recognized and a reference gesture picture in a gesture template library are scaled to be the same in size, then the Euclidean distance is calculated according to the following formula (1-7), and the reference gesture picture with the minimum Euclidean distance is found out and used as the reference gesture picture with the highest similarity:
wherein x is a picture to be recognized, T is a reference gesture picture set in a gesture template library, and T is an element in the set T,t * The element with the minimum Euclidean distance;
if t is found * Is less than a preset threshold value, t is determined * And the similarity between the gesture of the object to be recognized and the picture to be recognized is greater than a set threshold value, so that the gesture of the object to be recognized is judged as a photographing triggering gesture.
7. The utility model provides a shooting cube, characterized in that, the shooting cube includes control processor and camera with control processor communication respectively, the camera is used for acquireing the video stream of the object of waiting to shoot in the shooting cube, and with video stream transmission to the control processor; the control processor is used for controlling the camera to shoot according to the received video stream and outputting the shot image after processing the shot image; wherein the control processor is arranged to be able to execute an automatic photographing control method suitable for a photographing cube according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211020415.6A CN115297263B (en) | 2022-08-24 | 2022-08-24 | Automatic photographing control method and system suitable for cube shooting and cube shooting |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211020415.6A CN115297263B (en) | 2022-08-24 | 2022-08-24 | Automatic photographing control method and system suitable for cube shooting and cube shooting |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115297263A CN115297263A (en) | 2022-11-04 |
CN115297263B true CN115297263B (en) | 2023-04-07 |
Family
ID=83832259
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211020415.6A Active CN115297263B (en) | 2022-08-24 | 2022-08-24 | Automatic photographing control method and system suitable for cube shooting and cube shooting |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115297263B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020164282A1 (en) * | 2019-02-14 | 2020-08-20 | 平安科技(深圳)有限公司 | Yolo-based image target recognition method and apparatus, electronic device, and storage medium |
CN113223059A (en) * | 2021-05-17 | 2021-08-06 | 浙江大学 | Weak and small airspace target detection method based on super-resolution feature enhancement |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102420942A (en) * | 2011-11-28 | 2012-04-18 | 康佳集团股份有限公司 | Photograph device and photograph control method based on same |
CN106454071A (en) * | 2016-09-09 | 2017-02-22 | 捷开通讯(深圳)有限公司 | Terminal and automatic shooting method based on gestures |
CN109815893B (en) * | 2019-01-23 | 2021-03-26 | 中山大学 | Color face image illumination domain normalization method based on cyclic generation countermeasure network |
CN111062312B (en) * | 2019-12-13 | 2023-10-27 | RealMe重庆移动通信有限公司 | Gesture recognition method, gesture control device, medium and terminal equipment |
CN112506342B (en) * | 2020-12-04 | 2022-01-28 | 郑州中业科技股份有限公司 | Man-machine interaction method and system based on dynamic gesture recognition |
CN112837234B (en) * | 2021-01-25 | 2022-07-22 | 重庆师范大学 | Human face image restoration method based on multi-column gating convolution network |
CN113608663B (en) * | 2021-07-12 | 2023-07-25 | 哈尔滨工程大学 | Fingertip tracking method based on deep learning and K-curvature method |
-
2022
- 2022-08-24 CN CN202211020415.6A patent/CN115297263B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020164282A1 (en) * | 2019-02-14 | 2020-08-20 | 平安科技(深圳)有限公司 | Yolo-based image target recognition method and apparatus, electronic device, and storage medium |
CN113223059A (en) * | 2021-05-17 | 2021-08-06 | 浙江大学 | Weak and small airspace target detection method based on super-resolution feature enhancement |
Also Published As
Publication number | Publication date |
---|---|
CN115297263A (en) | 2022-11-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11847826B2 (en) | System and method for providing dominant scene classification by semantic segmentation | |
CN113065558B (en) | Lightweight small target detection method combined with attention mechanism | |
CN107392097B (en) | Three-dimensional human body joint point positioning method of monocular color video | |
CN109815826B (en) | Method and device for generating face attribute model | |
CN108234870B (en) | Image processing method, device, terminal and storage medium | |
CN106780543B (en) | A kind of double frame estimating depths and movement technique based on convolutional neural networks | |
CN110490252B (en) | Indoor people number detection method and system based on deep learning | |
CN110532970B (en) | Age and gender attribute analysis method, system, equipment and medium for 2D images of human faces | |
WO2019210555A1 (en) | People counting method and device based on deep neural network and storage medium | |
CN113286194A (en) | Video processing method and device, electronic equipment and readable storage medium | |
US20110299774A1 (en) | Method and system for detecting and tracking hands in an image | |
WO2022073282A1 (en) | Motion recognition method based on feature interactive learning, and terminal device | |
WO2020171379A1 (en) | Capturing a photo using a mobile device | |
CN111147751B (en) | Photographing mode generation method and device and computer readable storage medium | |
CN114360067A (en) | Dynamic gesture recognition method based on deep learning | |
CN111723707A (en) | Method and device for estimating fixation point based on visual saliency | |
CN112434608A (en) | Human behavior identification method and system based on double-current combined network | |
Küchhold et al. | Scale-adaptive real-time crowd detection and counting for drone images | |
WO2023142912A1 (en) | Method and apparatus for detecting left behind object, and storage medium | |
CN109063549A (en) | High-resolution based on deep neural network is taken photo by plane video moving object detection method | |
CN114708615A (en) | Human body detection method based on image enhancement in low-illumination environment, electronic equipment and storage medium | |
CN113065506A (en) | Human body posture recognition method and system | |
CN115297263B (en) | Automatic photographing control method and system suitable for cube shooting and cube shooting | |
CN115035596B (en) | Behavior detection method and device, electronic equipment and storage medium | |
CN108765384B (en) | Significance detection method for joint manifold sequencing and improved convex hull |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |