CN108307116B

CN108307116B - Image shooting method and device, computer equipment and storage medium

Info

Publication number: CN108307116B
Application number: CN201810122474.1A
Authority: CN
Inventors: 李科慧
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-02-07
Filing date: 2018-02-07
Publication date: 2022-03-29
Anticipated expiration: 2038-02-07
Also published as: CN108307116A

Abstract

The application relates to an image shooting method, an image shooting device, computer equipment and a storage medium. According to the method, the image acquired by the image acquisition device is acquired, at least one target main body is identified in the image, the posture change of the at least one target main body is continuously tracked, the posture of the at least one target main body is detected through the trained deep learning neural network model, and when the situation that the posture of the at least one target main body is matched with the preset target posture is detected, a shooting instruction is triggered. The method can accurately grasp the time point of the dynamic posture snapshot and improve the snapshot effect.

Description

Image shooting method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image capturing method and apparatus, a computer device, and a storage medium.

Background

With the development of computer technology, the shooting demand is increasing, and to record the most beautiful moment, people need to grasp the opportunity to press the shutter at the most appropriate moment. In order to take a proper picture, a plurality of shooting means are derived, and the shooting modes derived at present mainly comprise countdown shooting, Bluetooth triggering shooting and the like.

The conventional photographing manner is basically a photographing method based on time control, and with the above-described photographing method, there often occurs an action that has been photographed but the photographer is not ready for photographing, or the action of the photographer has ended at the time of photographing, and the optimum photographing time is missed. The existing shooting technology is difficult to grasp the time point of the snapshot dynamic posture, and the best snapshot effect cannot be obtained.

Disclosure of Invention

In view of the above, it is desirable to provide an image capturing method, an image capturing apparatus, a computer device, and a storage medium, which can accurately grasp a time point of a dynamic posture snapshot and improve a snapshot effect, in order to solve the above-mentioned technical problems.

An image capturing method comprising:

acquiring an image acquired by an image acquisition device;

identifying at least one target subject in the image and continuously tracking the posture change of the at least one target subject;

detecting the pose of the at least one target subject through the trained deep learning neural network model;

and when the gesture of the at least one target subject is detected to be matched with the preset target gesture, triggering a shooting instruction.

An image capturing apparatus comprising:

the image acquisition module is used for acquiring an image acquired by the image acquisition device;

the target subject recognition and tracking module is used for recognizing at least one target subject in the image and continuously tracking the posture change of the at least one target subject;

the gesture detection module is used for detecting the gesture of the at least one target subject through the trained deep learning neural network model;

and the shooting module is used for triggering a shooting instruction when the gesture of the at least one target subject is matched with the preset target gesture.

A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

acquiring an image acquired by an image acquisition device;

A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of:

acquiring an image acquired by an image acquisition device;

According to the image shooting method, the image shooting device, the computer equipment and the storage medium, the posture of the target body in the scene is monitored through the image acquisition device, and the target body is continuously tracked, so that the efficiency of detecting the target body is improved. And detecting the posture of the target subject according to the trained deep learning neural network to obtain a more accurate dynamic posture. Shooting is carried out according to the matching condition of the detected dynamic posture of the target body and the preset target posture, shooting can be triggered at the moment when the dynamic posture is completed, and the snapshot effect is improved.

Drawings

FIG. 1 is a diagram illustrating an exemplary embodiment of an image capture method;

FIG. 2 is a flow chart illustrating an image capture method according to one embodiment;

FIG. 3 is a schematic flow chart of target subject identification tracking in one embodiment;

FIG. 4 is a schematic flow chart illustrating training of a deep learning neural network according to one embodiment;

FIG. 5 is a schematic diagram of a process for training a deep learning neural network according to another embodiment;

FIG. 6 is a schematic flow chart of gesture detection in one embodiment;

FIG. 7 is a flowchart illustrating a process of performing persistent trigger shooting according to an embodiment;

FIG. 8 is a flow chart illustrating the completion of video capture in another embodiment;

FIG. 9 is a schematic diagram illustrating a multi-target subject triggered shot process in accordance with yet another embodiment;

FIG. 10 is a diagram of a terminal interface for a picture obtained by multi-subject triggered capture in one embodiment;

FIG. 11 is a diagram of a terminal interface for pictures obtained by multi-subject triggered capture in another embodiment;

FIG. 12 is a diagram of a picture terminal interface resulting from multi-subject triggered capture in yet another embodiment;

FIG. 13 is a diagram illustrating exemplary poses at different states during a jump-up in one embodiment;

FIG. 14 is a diagram illustrating a terminal interface of a picture taken when preset state parameters are met in one embodiment;

FIG. 15 is a flow diagram illustrating voice triggered capture in one embodiment;

FIG. 16 is a flowchart illustrating an exemplary embodiment of an image capture method;

FIG. 17 is a block diagram showing the configuration of an image pickup apparatus according to an embodiment;

FIG. 18 is a block diagram of the structure of target subject recognition tracking in one embodiment;

FIG. 19 is a block diagram of a pose detection model in one embodiment;

FIG. 20 is a block diagram showing the structure of a training unit of the pose network model in one embodiment;

FIG. 21 is a block diagram showing the structure of a posture detection model in another embodiment;

FIG. 22 is a block diagram showing the construction of an image pickup apparatus according to another embodiment;

FIG. 23 is a block diagram showing the structure of a video capture module according to an embodiment;

FIG. 24 is a block diagram showing a configuration of a photographing module in one embodiment;

FIG. 25 is a block diagram showing a configuration of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Fig. 1 is a diagram of an application environment in which an image capturing method is provided in an embodiment, as shown in fig. 1, and includes a terminal 110 and a server 120. The terminal 110 includes an image capturing device for capturing an image. The terminal 110 acquires an image acquired by the image acquisition device, detects a target subject and the posture of the target subject in the image through the image recognition model and the deep learning neural network model, matches the recognized posture of the target subject with a preset posture, and the image acquisition device performs shooting according to the matching result. When the matching is successful, the image acquisition device triggers a shooting instruction, and when the matching is unsuccessful, the image acquisition device continues to acquire images. And sending the shot image to a server through a network. Or the original image collected by the image collecting device may be sent to the server 120 through the terminal, the original image is processed in the server 120 to obtain the posture of the target subject in the image, and the posture of the target subject is returned to the terminal 110. The terminal 110 matches the returned result with a preset target posture, and when the matching is successful, the image acquisition device performs shooting, and when the matching is unsuccessful, the image acquisition device continues to acquire images.

The server 120 may be an independent physical server, or may be a server cluster formed by a plurality of physical servers, and may be a cloud server providing basic cloud computing services such as a cloud server, a cloud database, a cloud storage, and a CDN. The terminal 110 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a professional camera, etc. The server 120 and the terminal 110 may be connected in a communication connection manner such as a network, and the invention is not limited thereto.

In one embodiment, as shown in FIG. 2, an image capture method is provided. The method specifically comprises the following steps:

step S202, acquiring an image acquired by the image acquisition device.

The image acquisition device is a device for acquiring images, such as a camera, which generally has basic functions of video shooting/transmission, static image capture and the like, and processes and converts images into digital signals which can be recognized by a computer by a photosensitive component circuit and a control component in the camera after the images are acquired by a lens, thereby completing the image acquisition work. The camera on the shooting equipment can be directly utilized generally without redevelopment. The image is one or more images acquired by the image acquisition device, and the image can contain a subject which can be a human, an animal or a scene.

Specifically, an image including a target subject acquired by an image acquisition device is acquired. The image collection composed of a plurality of images continuously collected by the image collection device can be obtained, or the image collection composed of images collected by the image collection device according to a certain time interval can be obtained.

Step S204, at least one target subject is identified in the image, and the posture change of the at least one target subject is continuously tracked.

Among them, image recognition is a recognition technique that processes, analyzes, and understands an image with a computer to recognize objects and objects in the image. The image recognition technology is based on the main features of the image, and the main features of the image are extracted for recognizing the target subject in the image. A feature is a set of data that is used to describe a target subject. For example, a deep learning neural network includes multiple layers of networks, each layer of which extracts features of different dimensions. Taking a human face as an example, after machine learning, the features extracted from the bottom layer of the network are mainly basic features, such as left oblique lines, right oblique lines, transverse lines, vertical lines, points and the like, and the extracted features are local features, such as local features of five sense organs, in the upper network layer, the features of five sense organs are extracted from the upper network layer, and a human face is described through the geometric features of five sense organs, the position features of five sense organs and the like extracted from the upper network layer. The target subject is a behavior subject for triggering a shooting instruction, the shooting instruction is triggered by tracking the posture change of the target subject, and the target subject to be recognized can be preset or can be automatically recognized through a recognition algorithm to obtain the corresponding target subject. The target subject includes a person, an animal, a scene, and the like, for example, a target face image stored before shooting, so that the corresponding target subject is obtained by identifying the target face image in the shot image. One or more target subjects may be included in an image. The image tracking is to locate a target subject detected in an image shot by an image acquisition device to obtain position information of the target subject in the image. The target subject may be tracked using a tracking algorithm including, but not limited to, a neural network or a particle filter algorithm or a kalman filter algorithm. When the target subject is tracked, the tracking method includes, but is not limited to, only using any one of the tracking algorithms to track the target subject, or combining a plurality of tracking algorithms and then using the combined tracking algorithms to track the target subject. For example, the target body is tracked by adopting a particle filter algorithm, or the target body is tracked after the particle filter algorithm and a Kalman filter algorithm are combined.

Specifically, main features contained in the image are extracted, the extracted main features are analyzed to identify a target subject, and the posture change of the identified target subject is continuously tracked to obtain the position information of the target subject. The tracking of the target subject includes, but is not limited to, tracking the target subject in each frame or at certain time intervals or at certain frames. The change in the posture of the target subject may be a continuous change in the same posture or a plurality of changes in a plurality of postures.

And step S206, detecting the posture of at least one target subject through the trained deep learning neural network model.

The trained deep learning neural network model is obtained by learning the image data set carrying the attitude tag. The model is capable of detecting a posture of a target subject in an input image including the target subject and outputting the posture of the target subject. The posture is a motion made by the subject being photographed or a pose put out. Wherein, by way of example, actions and gestures include, but are not limited to, jumping, pointing to the sky, clapping the hands, waving the hands, turning over, everyone pointing away, throwing a hat, and the like. In the case of animals, the actions and postures include, but are not limited to, jumping, scratching, tongue, hind leg standing, or four foot skyward of the animal.

Specifically, an image containing a target subject is input into a trained deep learning neural network model, the posture feature of the target subject is extracted, and the posture of the target subject is determined according to the posture feature of the target subject. If the target subject is a human body, inputting an image containing the human body into the trained deep learning neural network model, and outputting to obtain the posture of the human body.

And step S208, when the gesture of at least one target subject is detected to be matched with the preset target gesture, triggering a shooting instruction.

The preset target gesture is a gesture preset for triggering a shooting instruction, and one or more preset target gestures can be set at a time. The preset target posture can be determined from the postures obtained by learning the postures in the image through a posture learning algorithm, or the preset target posture can be determined according to a self-defined posture template. For example, a pose learned from a pose in an image is learned according to a deep learning neural network model. The matching means that the posture characteristic of the target subject is the same as or similar to the characteristic of a preset target posture, the matching degree between the posture of the target subject and the preset target posture can be calculated, a threshold value of the matching degree is preset, and when the matching degree reaches the threshold value of the matching degree, the matching is judged to be successful.

Specifically, the posture of the target subject obtained through the trained deep learning neural network model is matched with a preset target posture for triggering photographing, and when the posture is matched with the same posture, the photographing equipment starts photographing and finishes photographing. The taking may be taking a photograph or a video. When the shooting mode set by the shooting equipment is shooting, after a shooting instruction is triggered, shooting is finished. The shooting mode set by the shooting equipment is a shooting mode for shooting videos, and after a shooting instruction is triggered, video shooting is completed. After the shooting is finished for one time, the shooting instruction can be triggered again, and the gesture for triggering the shooting instruction again can be the same as or different from the gesture for triggering the shooting instruction for the first time. The same shooting can set a plurality of target postures triggering shooting instructions, and when the target body is detected to be matched with at least one target posture in the target postures triggering shooting instructions, the shooting instructions are triggered. And shooting prompt information can be sent out when shooting is triggered.

In one embodiment, when there is one target subject in the image, a shooting instruction is triggered when it is detected that the dynamic posture of the target subject matches a posture set in advance for triggering the shooting instruction. If the target subject in the image is recognized as 1 person, and the posture for triggering the shooting instruction is preset as jumping, the shooting instruction is triggered when the target subject in the image performs jumping motion, and the shooting is finished.

In another embodiment, when a plurality of target subjects are present in the image, a shooting instruction is triggered when it is detected that the posture of any one of the target subjects or a preset number of the target subjects matches a posture set in advance for triggering the shooting instruction. The preset number can be defined according to needs, for example, 10 persons are identified as a target subject in the image, the gesture for triggering the shooting instruction is preset to be jumping, and the preset number is 3, when the action that 3 persons make jumping in the image is detected, the shooting instruction is triggered, and shooting is completed.

In still another embodiment, when the target subject in the image is plural, the photographing instruction is triggered when it is detected that the postures of all the target subjects in the image match the postures preset for triggering the photographing instruction. If the target subject in the image is recognized as 20 persons and the posture for triggering the shooting command is set as jumping in advance, the shooting command is triggered to complete shooting when the 20 persons are detected to simultaneously make jumping motions in the image.

According to the shooting method, the image acquired by the image acquisition device is acquired, the target body is identified in the image, the identified target body is tracked, the area of the region for detecting the image can be reduced by tracking the target body through the image, so that the detection time is reduced, the efficiency for detecting the target body is improved, the posture of the target body is detected through the trained deep learning neural network model, the trained deep learning neural network can rapidly learn the characteristics of the image, the dynamic posture of the target body is detected, shooting is carried out according to the matching condition of the detected dynamic posture of the target body and the preset target posture, shooting can be triggered at the moment of finishing the dynamic posture, and the snapshot effect is improved.

As shown in fig. 3, in one embodiment, step S204 includes:

step S204a, the current image is input into the trained image recognition model, and the trained image recognition model obtains the historical position information of at least one target subject in the historical image corresponding to the current image.

The image recognition model is used for recognizing a target subject in an image, and the position information of the target subject can be obtained through recognition. The image recognition model includes, but is not limited to, learning from a large number of tagged photographs, and is used to identify and locate and track a target subject in the captured photographs. The image tracking is to analyze the historical image to obtain the historical position information of the target subject, and predict the position information of the target subject in the current image according to the historical position information. In particular, the current image may be pre-processed before being input into the trained image recognition model. The pre-processing includes scaling the size of the image to a size corresponding to the image from which the image recognition model was trained. And converting the color space of the image according to algorithm requirements, wherein different recognition algorithms correspond to different image color spaces. Inputting the preprocessed current image into a trained image recognition model, and acquiring historical position information of the target body in at least one historical image before the current image through the image recognition model. The history image is an image of one or more frames before the current image, and the history position information is position information of the target subject on the history image. For example, the position information of the target subject of the previous frame or frames of the historical images is acquired.

In step S204b, a predicted position area of the at least one target subject in the current image is determined according to the historical position information.

Specifically, the predicted position area is a position area where the target subject is likely to appear in the current image, which is predicted by the image recognition model, and the position area of the target subject on the current image is predicted according to the historical position information of the target subject in the historical image. The time interval for the image acquisition device to acquire the images is smaller, and the moving position of the target main body is limited, so that the position area of the target main body in the current image can be accurately predicted according to the historical position information. The position area of the target subject in the current image can be predicted by combining the historical position information and the motion information of the target subject. For example, the predicted position area of the target subject in the current frame image is predicted according to the kalman state equation based on the position information of the target subject in the previous frame or multiple frames of history images.

In step S204c, when at least one target subject is detected within the predicted position area range, current position information of the at least one target subject in the current image is output.

Specifically, a predicted position region in the current image is detected, and when a target subject is detected in the predicted position region, an image composed of the detected position information of the target subject and the recognized target subject is used as output data of the image recognition model. In consideration of performance limitation of the mobile terminal device, the tracking algorithm can detect in an area adjacent to the historical position of the target body, and tracking efficiency of the target body is improved. The target subject is detected in the predicted position area by using the position information of the target subject, so that the detection time of the target subject can be shortened, and the detection efficiency is improved. And the auxiliary positioning is carried out through image tracking, so that the effect of real-time tracking is achieved.

In one embodiment, after the target subject is identified and located, a part of the frames can be selected to locate the target subject, so that the processing speed of the image is increased.

In one embodiment, since there is no historical position information as reference information when the target subject is detected for the first time, the entire image area is searched to determine the position information of the target subject when the target subject is detected.

In one embodiment, when the target subject is not detected in the predicted position area in the current image, the target subject is detected in the whole current image, the next round of localization and tracking process is performed, and the steps of identifying and tracking the subject in the image are repeated.

As shown in fig. 4, in an embodiment, before step S204, the method further includes:

and S402, inputting the training image set carrying the attitude label into a deep learning neural network model.

Specifically, the pose tag is data for explaining the pose of the target subject in the image. For example, a photo that includes a person jumping up may have a corresponding pose tag of "jumping up". The training image set of pose tags is a set of images that carry various pose tags. The set of training images is input into a deep learning neural network model.

Step S404, state data corresponding to the attitude tag is acquired.

Specifically, the state data is data in a custom format corresponding to the gesture tag, and may be vector data, matrix data, or the like. If the state data corresponding to the jump-up posture tag is (1, 0, 0, 0), the state data corresponding to the turn-over posture tag is (0, 1, 0, 0), the state data corresponding to the finger-up posture tag is (0, 0, 1, 0), and the state data corresponding to the rotation posture tag is (0, 0, 0, 1), the state data corresponding to the posture tag is acquired by the terminal.

And step S406, training the deep learning neural network model by taking the state data as an expected output result of the deep learning neural network model.

Specifically, the state data is used as an expected output result of the deep learning neural network model, and the deep learning neural network model is trained by taking the output result as a guide. If the attitude tag of one image is jumping and the corresponding state data is (1, 0, 0, 0.. multidot.), the expected state data after being processed by the deep learning neural network model is (1, 0, 0, 0.. multidot.), that is, the state data corresponding to the attitude tag is taken as the expected result of the deep learning neural network model learning.

And step S408, updating parameters of the deep learning neural network model to obtain the trained deep learning neural network model.

The deep learning neural network model weights the attitude features extracted from the input image to obtain corresponding output state data. The input image with the label is continuously learned, the characteristic of each posture is learned as much as possible, each posture is represented by the same posture characteristic set as much as possible, and the weight of each posture characteristic in the posture characteristic set corresponding to each posture is adjusted in the learning process. And updating parameters of the deep learning neural network model training model through learning, so that the posture correct recognition rate obtained by learning and recognizing the image set through the network model is as high as possible, and finishing the training of the deep learning neural network model when the posture correct recognition rate meets a preset range to obtain the trained deep learning neural network model. The deep learning neural network model can rapidly extract the characteristics of the image, the processing speed of the image is improved, and the time consumption is reduced. The posture of the target body is identified through the extracted features, the posture of the target body is obtained, and the accuracy of posture detection is improved.

In one embodiment, different multiple features can be extracted from the gestures triggering photographing corresponding to different application scenes according to the needs of the application scenes, and feature learning is performed, so that different complex gestures can be recognized by the trained network model for different application scenes. For example, for a sports event application scene, various complex postures such as basketball filling, diving, shooting, tennis serving, gymnastics emptying and the like exist, when a training data set is large enough and the feature extraction is complete enough, the trained network model can accurately identify the various complex postures, and the foul of a sports game can be assisted and judged according to the identified postures.

In one embodiment, the deep learning neural network model may comprise a convolutional deep learning neural network model. A weight sharing network structure is generally adopted in the convolution deep learning neural network model, so that the complexity of the network model is reduced, and the number of weights is reduced. Specifically, an image set carrying a posture label is input into a convolution deep learning neural network model, the network model is trained, parameters of the network are updated, and when the parameters of the updated network model can obtain a preset output accuracy rate, the training is stopped, so that a trained network model is obtained.

As shown in fig. 5, in one embodiment, step S408 includes:

step S408a, performing pose feature extraction on each image in the image set to obtain a corresponding pose feature set.

The posture characteristic is the state of each limb corresponding to the action of the target subject or the posed posture, or the dynamic characteristic of the scenery. For example, the basic movement patterns of the human body can be mainly classified into push-pull, whiplash, buffer, push-stretch, swing, twist, and opposite movements. The basic movement of the upper limbs can be summarized into pushing, pulling and whipping 3 types. The basic movement of the lower limbs can be summarized into 3 types of buffering, pedaling and extending and whipping. The whole body and trunk movements can be divided into 3 types of swinging, twisting and opposite movements. And (4) extracting the features of the image according to the basic motion action form, wherein the extracted pose feature set comprises upper limb pushing, upper limb pulling, upper limb whipping, lower limb buffering, lower limb kicking and stretching, lower limb whipping, swinging, twisting and opposite motion.

In step S408b, the weight of each pose feature in the pose feature set corresponding to each image is adjusted.

Wherein the weight is a weight of occupation of each pose feature in each pose. For example, the posture feature set is x (upper limb pushing, upper limb pulling, upper limb whipping, lower limb buffering, lower limb pedaling and stretching, lower limb whipping, swinging, twisting and opposite movement), the supported triggered shooting postures are (jumping, pointing to the sky, rotating and turning over), and the probability of the postures is expressed by a y vector (P jumping, P pointing to the sky, P rotating and P turning over). And obtaining a matrix W through machine learning training, wherein elements in the matrix are used for describing weights corresponding to all the attitude features in the attitude feature set. The ordering of each gesture in the gesture feature set may be randomly arranged, and is not limited to the arrangement manner.

Step S408c, weighting the corresponding posture features according to the weights of the posture features to obtain current state data.

The current state data are output data of the deep learning neural network model and are data corresponding to the postures, and different state data represent different postures. And weighting the attitude feature set to obtain state data, and obtaining the attitude corresponding to the image target subject according to the state data. For example, the probability corresponding to each posture is calculated from y ═ W × x, the calculation result is (0.9, 0, 0, 0.1), 0.9 represents the probability corresponding to the bounce posture, 0.1 represents the probability corresponding to the roll-over posture, and since the probability corresponding to the bounce posture is much larger than the probabilities corresponding to the other postures, the state data determined from the probabilities (0.9, 0, 0, 0.1) is (1, 0, 0, 0), and the posture corresponding to the state data (1, 0, 0, 0) is the bounce.

In step S408d, when the current state data and the expected state data satisfy the convergence condition, the target weight of each corresponding posture feature is obtained.

Specifically, convergence is the gradual reduction of the error to within a certain threshold range. The convergence condition may be a gesture error recognition rate threshold value obtained by learning the image set carrying the gesture tag according to the target weight. And when detecting that the error recognition rate in the image set carrying the attitude tag is within the threshold range of the attitude error recognition rate, obtaining the target weight of each corresponding attitude feature. The error identification rate is calculated according to the current state data and the expected state data. And the current state data are consistent with the expected state data and represent correct identification, the current state data are inconsistent with the expected state data and represent error identification, and the error identification rate is obtained by counting the error identification number and the test data. If the error recognition rate threshold is 0.15, the deep learning neural network model is obtained by training through the training image, the test data is input into the deep learning neural network model for testing, the error recognition rate of the test image is calculated, and if the error recognition rate is 0.17, the corresponding convergence image is reached. If the error recognition rate is 0.15, it means that the convergence condition is satisfied.

Step S408e, obtaining parameters of the deep learning neural network model according to the target weight, and obtaining the trained deep learning neural network model.

Specifically, the trained deep learning neural network model is obtained by taking the target weights as parameters of the deep learning neural network model. And inputting the image into the trained deep learning neural network model, extracting the attitude characteristics of the target subject, weighting the attitude characteristics according to the parameters of the deep learning neural network model to obtain corresponding state data, and outputting the state data serving as the deep learning neural network model. The training data is enough, so that the more accurate the characteristics obtained by learning the deep learning neural network model obtained by training are, the more accurate the result obtained by performing posture detection through the deep learning neural network model is.

As shown in fig. 6, in one embodiment, step S206 includes:

in step S206a, an image region containing at least one target subject is input into the trained deep learning neural network model.

Specifically, the target subject is a target subject recognized from the above-described image recognition model. The image region including the target subject may be an image obtained by dividing the target subject recognized by the image recognition model, or may be an acquired current image including the target subject. The target subject may comprise a person, an animal or a scene, and the image region containing the target subject is input into the trained deep learning neural network model.

Step S206b, performing pose feature extraction on the image region including at least one target subject to obtain a target pose feature set corresponding to the at least one target subject.

The target posture feature set is a feature set formed by a plurality of features of the motion or posture sent by the target subject. Taking the four feet of the animal facing the sky as an example, the posture characteristics obtained by extracting the posture characteristics of the animal are the pushing-up action of the limbs of the animal and the orientation of the limbs.

Specifically, the pose features of an image region containing a target subject are extracted according to a feature extraction algorithm, and the extracted pose features of the target subject are arranged in a certain order or randomly to obtain a corresponding pose feature set, wherein the pose feature set is a vector containing a plurality of pose features.

Step S206c, weighting each posture feature in the posture feature set of at least one target subject according to the weight of each posture feature to obtain corresponding target state data.

Specifically, the weight of each posture feature is a weight corresponding to a parameter of the trained deep learning neural network recognition model, and the target state data is obtained by performing weighting processing on each posture feature in the posture feature set of the target subject through the parameter. The target state data is state data corresponding to the posture of the target subject.

Step S206d, obtaining a target posture of at least one target subject according to the corresponding relationship between the target state data and the posture.

Specifically, there is a correspondence between the state data and the pose, which is defined before the deep learning neural network training described above. Therefore, when the state data of the posture of the target subject in the image is obtained through calculation, the posture of the target subject is determined by searching the corresponding relation between the state data and the posture.

The target posture can be quickly obtained by carrying out posture detection according to the deep learning neural network model, the posture characteristics of the target main body can be learned from multiple dimensions by the deep learning neural network model, more accurate posture characteristics of the target main body can be obtained, and the shooting effect is improved.

In one embodiment, the image recognition model and the deep learning neural network model may be combined into one neural network model, and the neural network model may recognize and track a target subject of an input image, detect a pose of the recognized target subject, and determine a pose of the target subject in the image. The neural network model is obtained by learning a target subject and a posture of the target subject included in the plurality of images.

As shown in fig. 7, in an embodiment, after step S208, the method further includes:

and step S602, continuing to acquire the image acquired by the image acquisition device.

Specifically, after a shooting instruction is triggered and shooting is completed, a new image acquired by the image acquisition device is acquired. The new image may be an image acquired after the camera is moved or an image acquired in situ.

Step S604, identifying at least one target subject in the image, continuously tracking the posture change of the at least one target subject, and triggering the shooting instruction again when the posture of the target subject is detected to be matched with the preset target posture.

Specifically, the image acquisition device continues to acquire new images, processes the acquired new images, identifies a target subject in the images through an image identification algorithm, and tracks and positions the identified target subject. The tracking and positioning of the identified target subject includes, but is not limited to, positioning the target subject in a partial image or positioning the target subject in a full image. The pose of the identified target subject is detected by a pose detection model, including but not limited to a trained deep learning neural network model. And when the gesture detected by the gesture detection model is matched with the preset target gesture, shooting again to obtain a new video and a new picture. Wherein the preset target gestures may be one or more. And when the detected posture of the target body is matched with any one preset target posture in the plurality of preset target postures, triggering the shooting instruction again. When a plurality of target subjects are contained in one frame of image, when the gesture of at least one target subject is detected to be matched with the preset target gesture, a shooting instruction is triggered. When the target body posture detection device comprises a plurality of target bodies and a plurality of preset target postures, when the detected posture of at least one target body in the plurality of target bodies is matched with any one preset target posture in the plurality of preset target postures, a shooting instruction is triggered.

And step S606, the step of continuously acquiring the images acquired by the image acquisition device is repeatedly carried out, and the continuous trigger shooting is completed.

Specifically, after completion of the re-photographing, the step of detecting the acquired image is repeated. The image acquisition device is always in a working mode, and the steps of acquiring the image, detecting the posture and triggering the shooting instruction are continuously repeated. Repeated shots can result in more natural pictures and videos. And when the image acquisition device faces the target body, the image acquisition device continuously and repeatedly executes the steps from image acquisition to triggering of shooting instructions, and shooting is completed. For example, the image acquisition device is directed to the target subject, the preset target posture comprises but is not limited to actions of clapping, waving, turning over or jogging of a child, and when the actions of clapping, waving, turning over or jogging of the child are detected, a shooting instruction is triggered. When three children are detected in the image, any one of the children triggers the shooting instruction when the child does the action. The preset target postures include but are not limited to actions of jumping, scratching and itching, tongue spitting, standing on the back legs, facing the sky with four feet and the like, when the actions are detected, a shooting instruction is triggered, shooting is completed, various eruptions are obtained, and the beautiful moment of the kids and the slow eruption moment of the pets are recorded.

As shown in fig. 8, in one embodiment, the preset target gesture includes a start gesture and an end gesture, and step S208 includes:

and step S208a, when the gesture of at least one target subject is detected to be matched with the initial gesture, triggering a shooting instruction to continuously acquire the pictures shot by the image acquisition device.

Specifically, when the preset target posture is used for triggering the preset target posture of the shooting instruction, the preset target posture is taken as the starting posture. When the fact that the action made by the target body identified in the image is consistent with the preset target posture used for triggering the shooting instruction is detected, the shooting instruction is triggered, and the picture collected by the image collecting device is continuously obtained.

Step S208b, when it is detected that the pose of the at least one target subject matches the termination pose, causing the image capturing apparatus to stop capturing pictures.

Specifically, the termination posture is when the preset target posture is a posture for ending shooting. And when the action made by the target subject identified in the image is detected to be consistent with the target posture for terminating the shooting, stopping acquiring the pictures acquired by the continuous image acquisition device. And forming a video by continuously acquiring images acquired by the image acquisition device between the starting posture and the ending posture. More information can be saved by adopting video recording, and the video can record the whole process of a plurality of dynamic gestures.

In one embodiment, the preset target gesture includes a plurality of preset target sub-gestures, and step S208 includes:

when the gesture of at least one target subject is detected to be matched with the preset target gesture, a shooting instruction is triggered, and a plurality of sub-gesture pictures containing the same preset target gesture are obtained through shooting.

The preset gestures comprise a plurality of preset target sub-gestures, which means that a motion comprises a plurality of preset target sub-gestures meeting the preset gestures in the whole time period from beginning to end, and the gestures record the change process of the gestures of the photographer in the whole time period according to the time sequence.

Specifically, a plurality of pictures can be continuously taken when the photographing is triggered. The beginning to the end of each action lasts for a period of time, a plurality of pictures can be continuously taken during the action duration, and the whole flow of the action is recorded. For example, in a series of pictures of jumping of a person, the height of the person jumps upwards from the beginning to the end of landing, and when the posture of the target subject is detected as jumping, the person starts to take pictures and continuously takes a plurality of pictures. Shooting can be carried out continuously according to preset shooting time intervals, and the image acquisition device can acquire a plurality of frames recently after shooting is triggered and store the frames. How many frames of images are saved can be set by a user, such as setting 3 or 5 pictures to be taken continuously.

In one embodiment, the preset target gesture set is composed of a plurality of preset target gestures, and step S208 includes: and triggering shooting when the detected gesture of at least one target subject is matched with any one target gesture in the preset target gesture set.

Specifically, the preset target posture set is composed of a plurality of preset target postures, and the preset target postures can be determined according to the expected shooting content of the photographer. When shooting is carried out, when the fact that the posture of the target body is matched with any one preset target posture in the preset target postures is detected, a shooting instruction is triggered. For example, the preset target postures include jumping, pointing to the sky, turning over and the like, when the jumping posture of the target main body is detected, pictures are shot, or when the pointing to the sky action of the target main body is detected, a shooting instruction is triggered. And when the gesture made by the target body is matched with any one preset target gesture in the plurality of preset target gestures, triggering a photographing instruction to finish photographing.

In one embodiment, the preset target postures comprise a plurality of preset target postures, the shooting device can trigger multiple times of shooting according to the preset target postures, the shooting instruction is triggered when the posture of the target body is detected for the first time to be matched with any one preset target posture in the preset target postures, the first shooting is completed, the image acquisition device acquires the image again, the image acquired again is detected, when the target body is detected to be matched with any one preset target posture in the preset target postures again, the shooting instruction is triggered again, the shooting is completed again, the image acquisition device continues to acquire the image, and the steps of detecting the image, triggering the shooting and completing the shooting are repeatedly carried out. For example, shooting a basketball game, wherein the preset target gesture comprises but is not limited to shooting, dribbling, passing, shooting, catching a basketball and other actions, when the shooting action in the image is matched with the shooting action in the preset target gesture, a shooting instruction is triggered, the picture containing the shooting action is saved, then the picture is continuously obtained, the gesture of the target subject in the picture is detected again, when the shooting is detected, the current shooting is completed, the picture containing the posture of the shooting is obtained, the image acquisition device continuously obtains the picture, and when the gesture dribbling of the target subject in the picture is detected, the current shooting is completed, and the picture containing the posture of the dribbling is obtained. And repeatedly acquiring the picture, detecting whether the gesture in the picture is matched with the preset target gesture, and finishing photographing during matching. The shooting mode of triggering the shooting instruction and the preset target postures for multiple times can obtain more and more natural images.

As shown in fig. 9, in one embodiment, step S208 includes:

in step S208c, when it is detected that a plurality of target subjects are included in the target subject, postures of the plurality of target subjects are detected.

Specifically, a plurality of target subjects are included in the same frame image, and the types of the target subjects may be uniform or non-uniform. The target subjects in the same image learned according to the application scene may be of the same type or of different types. And detecting the postures of the plurality of target subjects recognized by the image, wherein the postures of the target subjects are detected or the postures made by all the target subjects are detected.

In step S208d, when it is detected that the postures of the plurality of target subjects simultaneously match the preset target postures, a shooting instruction is triggered.

Specifically, the condition for triggering the shooting instruction is that when a plurality of target subjects simultaneously satisfy the trigger shooting, the shooting instruction is triggered when the detected postures of all the target subjects satisfy the preset target postures. As shown in fig. 10, the horizontal line at the bottommost position in fig. 10 indicates the ground, and when all 4 persons in the figure are recognized to be lifted off the ground, the photographing is performed, and a jump photograph of a plurality of persons is taken. If the gestures jointly made by the multiple target subjects in the image are detected, when the gestures jointly made by the multiple target subjects are detected to be consistent with the preset target gestures, a shooting instruction is triggered. For example, shooting of actions such as a acrobatic performance completed by multiple persons, a competitive project completed by multiple persons in a competitive sport, and the like. The gestures of the target bodies are detected, shooting is triggered according to the gestures of the target bodies, so that shooting is more convenient, each target body can enter the house, an extra photographer is not needed for shooting, and shooting is more convenient.

In the present embodiment, when the motions of all target subjects are kept the same, shooting is triggered, and as shown in fig. 11, when a smiling face is exposed on all the faces of the persons, a shooting instruction is triggered, that is, when 4 detected target subjects simultaneously appear a smiling face, a shot is taken. As shown in fig. 12, when 5 persons in the figure all put an action named as "this is a goal", the action picture containing the action named as "this is a goal" is triggered and obtained.

In one embodiment, step S206 includes: the deep learning neural network model detects the posture of at least one target subject to obtain the posture of the at least one target subject and a state parameter corresponding to the posture of the at least one target subject, wherein the change of the state parameter reflects the state change of the posture of the target subject corresponding to the state parameter.

The state parameter is used for representing the current state degree of the corresponding posture of the target subject, and the current state degree of the posture changes along with the dynamic change of the posture, so that the dynamic change of the state parameter detected by the deep learning neural network model is realized. Specifically, a target subject identified by the deep learning neural network model is detected, and a posture corresponding to the target subject and a state parameter corresponding to the posture are obtained, wherein the state parameter is used for representing the state of the target subject, and the whole continuous process from formation to termination of the posture comprises a plurality of state parameters, and each state parameter represents different states of the posture of the target subject.

In one embodiment, the state parameter is a state parameter of a type corresponding to the gesture type, such as a jump-up height when the gesture type is a jump-up; when the gesture type is smiling face, the state parameter corresponding to the smiling face gesture type is the degree of smiling face, and the like.

In another embodiment, the status parameters include, but are not limited to, being represented numerically or by a hierarchy, etc. For example, the height level of the person jumping in the image when the photograph of the person jumping is taken, the height level corresponding to the height value. For example, the jump-up height level may be classified as including, but not limited to, 3 levels, wherein a jump-up height is in class 1 when within a first height value threshold range, a jump-up in class 2 between the first height value threshold range and a second height value threshold range, and a jump-up in class 3 beyond the second height value threshold range.

Step S208 includes: and when the gesture of at least one target body is matched with the preset target gesture and the state parameter of the gesture of the target body matched with the preset target gesture meets the preset state parameter, triggering a shooting instruction.

Specifically, when it is detected that the posture of at least one target subject in the plurality of target subjects matches a preset target posture and the state parameters of the matched target subject match preset target state parameters, photographing is triggered. When shooting is controlled through the state parameters and the postures, the shooting postures can be shot more accurately through the state parameters.

In one embodiment, the state parameter corresponding to the posture of the target subject matching the preset target posture satisfies the preset state parameter, including: and when the state parameter corresponding to the posture of the target body is greater than or equal to the preset state parameter, judging that the state parameter corresponding to the posture of the target body matched with the preset target posture meets the preset state parameter.

Specifically, if the preset state parameter is the jump-up height threshold, when the jump-up height of the posture of the target body is greater than or equal to the jump-up height threshold, the state parameter corresponding to the posture of the target body matched with the preset target posture satisfies the preset state parameter. Since the process from jumping to falling is a process with continuously changing height, an image set with continuously changing shooting postures can be triggered by preset state parameters.

In one embodiment, the preset state parameter is a preset expression change coefficient range, and the state parameter corresponding to the posture of the target subject matched with the preset target posture meets the preset state parameter, including: and when the expression change coefficient of the posture corresponding to the target subject matched with the preset target posture is within the preset expression change coefficient range, judging that the state parameter corresponding to the posture of the target subject matched with the preset target posture meets the preset state parameter. Specifically, if the preset expression change coefficient range is [0.5, 0.8], the change amplitude of the preset expression is represented, where 0.5 represents smile and 0.8 represents smile, if the expression change coefficient corresponding to the posture of the target subject matched with the preset target posture is 0.6, the expression corresponding to the target subject is within the preset expression change coefficient range, and it is determined that the state parameter corresponding to the posture of the target subject matched with the preset target posture meets the preset state parameter.

In a specific embodiment, the state parameter may be set as a jump height, as shown in fig. 13, different states of the jump posture of the same person in the process of jumping from the beginning to the falling back to the ground are shown from left to right in fig. 13, and different state parameters, that is, different jump heights, are respectively corresponding to the different states. The states are denoted by

reference numerals

001, 002, 003, 004 and 005, respectively. Wherein, the 2 states corresponding to the

reference numerals

001 and 005 indicate that although the jumping posture of the person is detected, the jumping height of the person does not satisfy the preset jumping height threshold value, so the shooting instruction is not triggered, the 3 states corresponding to the

reference numerals

002, 003 and 004 indicate that the jumping posture is detected, and the jumping height reaches the preset jumping height threshold value, when the posture and the state parameter satisfy the preset condition at the same time, the shooting instruction is triggered, so as to obtain the shot images shown in fig. 14, the shot images corresponding to the 3 states corresponding to the

reference numerals

002, 003 and 004 are the image 010, the image 020 and the image 030 respectively, and the images form the shot image set.

As shown in fig. 15, in one embodiment, step S208 includes:

step S208e, voice data of at least one target subject is acquired.

Specifically, the voice data is a sound uttered by the subject acquired by the voice acquisition means. The voice data may include gesture information corresponding to a gesture, or may be specific voice data. The voice data contains text information matched with the posture of the target subject, such as voice data containing but not limited to meanings of 'oil adding', 'hand lifting', 'jumping', and the like.

Step S208f, performing voice detection and recognition on the voice data to obtain a corresponding voice recognition result.

Specifically, the acquired voice data is detected and recognized by the voice recognition device. The detection and recognition modes include, but are not limited to, extracting text information in the voice data as a voice recognition result or extracting a time domain signal and a frequency domain signal of the voice data to obtain a corresponding voice recognition result. For example, the text information obtained by detecting the voice data is "indicating day", "refueling", "raising hands", "jumping", etc., or the time domain waveform or the frequency domain waveform obtained by processing the voice data is similar to or the same as the time domain waveform or the frequency domain waveform of the preset voice data, respectively.

In step S208g, when it is detected that the gesture of the at least one target subject matches the preset target gesture and the voice recognition result matches the preset target voice data, a shooting instruction is triggered.

Specifically, when it is detected that the posture of at least one target subject of the plurality of target subjects matches a preset target posture and the voice recognition result matches preset target voice data for triggering a shooting instruction, the shooting instruction is triggered. When the voice recognition result is text information, the text information includes but is not limited to text information corresponding to the gesture or is determined by the corresponding relationship between the gesture and the text information through the preset corresponding relationship between the gesture and the text information. And whether the voice recognition result is matched with the text information in the preset voice data or not is judged, and when the matching is successful, a shooting instruction is triggered. And when the voice recognition result is a time domain signal or a frequency domain signal of the voice data and is successfully matched with a time domain signal or a frequency domain signal of preset target voice data, triggering a shooting instruction. The image shooting method for controlling shooting through voice data and gestures simultaneously can capture more accurate shooting gestures. When any one of the voice data and the posture is not in accordance with the preset condition, the shooting instruction is not triggered, and the shooting error is reduced. If the target body is detected to jump up, and the character information obtained by the voice data recognition of the target body is 'indicating the day', the shooting instruction is not triggered; and when the text information obtained by the voice data recognition of the target main body is 'jump up', triggering a shooting instruction to finish shooting.

As shown in fig. 16, in a specific embodiment, the photographing method includes:

step S802, acquiring an image acquired by an image acquisition device.

Step S804, performing target recognition and tracking on the image through the trained image recognition model, where the trained image recognition model is obtained by learning the image carrying the target subject label. And predicting the position of the target main body in the current image according to the acquired position information of the target main body in the historical image to obtain a predicted position area. And detecting the target subject in the range of the predicted position area, and identifying the target subject.

Step S806, inputting an image area containing at least one target subject into the trained deep learning neural network model, extracting the attitude features of the image area containing the target subject according to a feature extraction algorithm, and weighting each attitude feature according to the weight corresponding to each attitude feature in the parameters of the trained deep learning neural network model to obtain corresponding target state data. And searching a target attitude corresponding to the target state data according to the corresponding relation between the state data and the attitude.

And step S808, matching the target posture with a preset target posture, and executing step S810 when the matching is successful. The preset target postures can be one or more, for example, the preset target postures comprise actions of jumping, pointing to sky, turning over and the like. When it is detected that the target subject makes the motion of pointing to the day, step S810 is performed. When the gesture of the target subject is detected not to match the preset target gesture, the step S802 is returned to. If it is detected that the target subject does not make any of the three preset target poses, the method returns to step S802, and the steps S802 to S808 are repeatedly executed.

In step S810, it is detected whether the photographing apparatus sets continuous photographing. If the continuous shooting is set, step S812 is executed. If the continuous shooting is not set, step S822 is executed.

Step S812, triggering the photographing, wherein a plurality of photos can be continuously photographed or only one photo can be photographed. For example, when shooting, including but not limited to 1 time shooting, 3 continuous shooting, 5 continuous shooting, etc., step S830 is executed after shooting.

In step S822, shooting of a video is started.

In step S824, it is detected whether a termination gesture for terminating the video recording is set. When detecting that the termination gesture for terminating the video recording is set, step S826a is executed. Detecting that the termination gesture for terminating the video recording is not set, step S826b is performed.

In step S826a, it is detected whether the pose corresponding to the target subject in the image captured by the image capture device matches the termination pose in the preset target poses. If the termination gesture is turning, step S828 is executed when the motion of turning over the target subject in the video image is detected.

In step S826b, the time length of video capturing is acquired, and it is detected whether the time length of video capturing matches the preset capturing time length. When the time length of video photographing reaches the photographing time length set in advance, step S828 is performed. For example, the length of time for capturing a video is set to 4 minutes, and when it is detected that the length of time for capturing a video has reached 4 minutes, step S828 is performed.

In step S828, when the conditions of step S826a and step S826b are satisfied, that is, when the length of time for which the subject in the video image turns over or the video is captured reaches 4 minutes, the capturing of the video is stopped. After the photographing is stopped, step S830 is performed.

In step S830, the photograph taken in step S812 and the resulting image taken in step S828 are saved.

Step S832, when step S830 is completed, detects whether the shooting device has a function of multi-shot shooting, if so, proceeds to step S802, and repeatedly executes the above steps S802 to S832, and if not, proceeds to step S834.

In step S834, the shooting is ended.

As shown in fig. 17, in one embodiment, there is provided an image photographing device 200 including:

the image acquisition module 202 is configured to acquire an image acquired by the image acquisition device.

And the target subject recognizing and tracking module 204 is used for recognizing at least one target subject in the image and continuously tracking the posture change of the at least one target subject.

And the gesture detection module 206 is used for detecting the gesture of at least one target subject through the trained deep learning neural network model.

And the shooting module 208 is configured to trigger a shooting instruction when the gesture of the at least one target subject is detected to match the preset target gesture.

As shown in fig. 18, in one embodiment, the target subject identification tracking module 204 includes:

the historical position obtaining unit 204a is configured to input the current image into a trained image recognition model, and the trained image recognition model obtains historical position information of at least one target subject in a historical image corresponding to the current image.

And the prediction unit 204b is used for determining a prediction position area of the at least one target subject in the current image according to the historical position information.

A current position output unit 204c, configured to output current position information of the at least one target subject in the current image when the at least one target subject is detected within the predicted position area range.

As shown in fig. 19, in one embodiment, the image photographing apparatus 200 includes:

an image data input unit 402, configured to input the training image set carrying the pose label into the deep learning neural network model.

A state data acquisition unit 404 for acquiring state data corresponding to the attitude tag.

And the network model training unit 406 is configured to train the deep learning neural network model by using the state data as an expected output result of the deep learning neural network model, and update parameters of the deep learning neural network model to obtain the trained deep learning neural network model.

As shown in fig. 20, in one embodiment, the training unit 406 includes:

the feature extraction subunit 406a is configured to perform pose feature extraction on each image in the image set to obtain a corresponding pose feature set.

And a weight adjusting subunit 406b, configured to adjust the weight of each pose feature in the pose feature set corresponding to each image.

And the current state data calculating subunit 406c is configured to obtain current state data after weighting the corresponding posture features according to the weights of the posture features.

And the target weight calculating subunit 406d is configured to obtain the target weight of each corresponding posture feature when the current state data and the expected state data satisfy the convergence condition.

And the network model determining subunit 406e is configured to obtain parameters of the deep learning neural network model according to the target weight, so as to obtain a trained deep learning neural network model.

As shown in fig. 21, in one embodiment, the gesture detection module 206 includes:

an image input unit 206a, configured to input an image region including at least one target subject into the trained deep learning neural network model.

The target pose feature set extracting unit 206b is configured to perform pose feature extraction on an image region including at least one target subject to obtain a target pose feature set corresponding to the at least one target subject.

And the target state data calculating unit 206c is configured to weight each posture feature in the posture feature set of the at least one target subject according to the weight of each posture feature to obtain corresponding target state data.

And the target posture searching unit 206d is configured to obtain a target posture of at least one target subject according to the corresponding relationship between the target state data and the posture.

As shown in fig. 22, in one embodiment, the image capturing apparatus 200 further includes:

the image acquisition module 202 is further configured to continue to acquire the image acquired by the image acquisition device.

The method comprises the steps of repeatedly entering a target subject recognition and tracking module 204 to recognize a target subject in an image, continuously tracking the posture change of at least one target subject, detecting the posture of the recognized at least one target subject in a posture detection module 206, and triggering a shooting instruction to finish shooting when the detected posture of the at least one target subject is matched with a preset target posture by a shooting module 208.

And repeatedly entering the image acquisition module 202 to finish continuous trigger shooting.

As shown in fig. 23, in one embodiment, the photographing module 208 includes:

and the continuous shooting unit 208a is used for triggering a shooting instruction to continuously acquire the pictures shot by the image acquisition device when the gesture of at least one target subject is detected to be matched with the initial gesture.

A stop photographing unit 208b for stopping the image pickup device from photographing the picture when it is detected that the posture of the at least one target subject matches the termination posture.

In one embodiment, the shooting module 208 is further configured to trigger a shooting instruction to shoot a sub-pose photo corresponding to the plurality of preset target sub-poses when it is detected that the pose of the at least one target subject matches the preset target sub-pose.

In one embodiment, the capturing module 208 is further configured to trigger the capturing instruction when the gesture of the at least one target subject is detected to match any one of a set of preset target gestures.

In one embodiment, the photographing module 208 is further configured to detect postures of a plurality of target subjects when the target subjects are detected to include the plurality of target subjects, and trigger a photographing instruction when the postures of the plurality of target subjects are detected to simultaneously match a preset target posture.

In one embodiment, the state detection module 206 is further configured to detect the pose of the at least one target subject by using the deep learning neural network model, and obtain the pose of the at least one target subject and a state parameter corresponding to the pose of the at least one target subject, where a change in the state parameter reflects a state change of the pose of the target subject corresponding to the state parameter.

In this embodiment, the shooting module 208 is further configured to trigger a shooting instruction when it is detected that the posture of at least one target subject matches the preset target posture and the state parameter of the posture of the target subject matching the preset target posture meets the preset state parameter.

As shown in fig. 24, in one embodiment, the photographing module 208 includes:

a voice acquiring unit 208e, configured to acquire voice data of at least one target subject.

The voice recognition unit 208f is configured to perform detection and recognition on the voice data to obtain a corresponding voice recognition result.

And the voice gesture shooting unit 208g is configured to trigger a shooting instruction when the gesture of the at least one target subject is detected to match the preset target gesture and the voice recognition result matches the preset target voice data.

Fig. 25 is a diagram showing an internal structure of a computer device in one embodiment, and the computer device is connected with a processor, a nonvolatile storage medium, an internal memory, and a network interface through a system connection bus. Wherein the non-volatile storage medium of the computer device may store an operating system and computer readable instructions that, when executed, may cause the processor to perform an image capture method. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The internal memory may have stored therein computer readable instructions that, when executed by the processor, cause the processor to perform an image capture method. The network interface of the computer device is used for performing network communication, such as receiving images, sending stop control instructions, and the like. Those skilled in the art will appreciate that the architecture shown in fig. 25 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the photographing apparatus provided by the present application may be implemented in the form of a computer program that can be run on a computer device as shown in fig. 25, and a non-volatile storage medium of the computer device may store various program modules constituting the photographing apparatus, such as the image acquisition module 202 in fig. 17. Computer readable instructions are included in the program modules, and are used for causing a computer device to execute the steps in the shooting method of the embodiments of the present application described in the present specification, for example, the computer device may acquire an image acquired by an image acquisition module 202 shown in fig. 17. At least one target subject is recognized in the image by the target subject recognition and tracking module 204, and the posture change of the at least one target subject is continuously tracked. The pose of the at least one target subject is detected by the pose detection module 206 through the trained deep learning neural network model. When the gesture of at least one target subject is detected to be matched with the preset target gesture through the shooting module 208, a shooting instruction is triggered, and shooting is completed.

In one embodiment, a computer readable storage medium is provided, storing a computer program that, when executed by a processor, causes the processor to perform the steps of: the method comprises the steps of obtaining an image collected by an image collecting device, identifying at least one target main body in the image, continuously tracking the posture change of the at least one target main body, detecting the posture of the at least one target main body through a trained deep learning neural network model, and triggering a shooting instruction when the posture of the at least one target main body is matched with a preset target posture.

In one embodiment, identifying at least one target subject in the image and continuously tracking a change in pose of the at least one target subject includes: inputting a current image into a trained image recognition model, acquiring historical position information of at least one target subject in a historical image corresponding to the current image by the trained image recognition model, determining a predicted position area of the at least one target subject in the current image according to the historical position information, and outputting the current position information of the at least one target subject in the current image when the at least one target subject is detected in the predicted position area.

In one embodiment, prior to detecting the pose of the at least one target subject by the trained deep learning neural network model, the computer program further causes the processor to perform the steps of: inputting the training image set carrying the attitude label into the deep learning neural network model, acquiring state data corresponding to the attitude label, taking the state data as an expected output result of the deep learning neural network model, training the deep learning neural network model, updating parameters of the deep learning neural network model, and obtaining the trained deep learning neural network model.

In one embodiment, updating parameters of the deep learning neural network model to obtain a trained deep learning neural network model includes: extracting the attitude features of each image in the image set to obtain a corresponding attitude feature set, adjusting the weight of each attitude feature in the attitude feature set corresponding to each image, weighting the corresponding attitude feature according to the weight of each attitude feature to obtain current state data, obtaining the target weight of each corresponding attitude feature when the current state data and the expected state data meet the convergence condition, obtaining the parameters of the deep learning neural network model according to the target weight, and obtaining the trained deep learning neural network model.

In one embodiment, detecting the pose of at least one target subject through a trained deep learning neural network model comprises: inputting an image area containing at least one target subject into a trained deep learning neural network model, extracting attitude features of the image area containing at least one target subject to obtain a target attitude feature set corresponding to at least one target subject, weighting each attitude feature in the attitude feature set of at least one target subject according to the weight of each attitude feature to obtain corresponding target state data, and obtaining the target attitude of at least one target subject according to the corresponding relation between the target state data and the attitude.

In one embodiment, before triggering the shooting instruction, the computer program further causes the processor to perform the steps of: continuously acquiring images acquired by the image acquisition device; and entering a step of identifying at least one target subject in the image and continuously tracking the posture change of the at least one target subject, triggering the shooting instruction again when detecting that the posture of the at least one target subject is matched with the preset target posture, and repeatedly entering a step of continuously acquiring the image acquired by the image acquisition device to finish continuous triggering shooting.

In one embodiment, the preset target gesture includes a start gesture and an end gesture, and when it is detected that the gesture of the at least one target subject matches the preset target gesture, the shooting instruction is triggered, including: when the gesture of the at least one target subject is detected to be matched with the initial gesture, triggering a shooting instruction, continuously acquiring the picture shot by the image acquisition device, and when the gesture of the at least one target subject is detected to be matched with the termination gesture, stopping shooting the picture by the image acquisition device.

In one embodiment, the preset target gesture includes a plurality of preset target sub-gestures, and when it is detected that the gesture of the at least one target subject matches the preset target gesture, the shooting instruction is triggered, including: and when the gesture of at least one target main body is matched with the preset sub-target gestures, triggering a shooting instruction to obtain sub-gesture pictures corresponding to the preset target sub-gestures.

In one embodiment, the preset target gesture set is composed of a plurality of preset target gestures, and when the gesture of at least one target subject is detected to match the preset target gesture, the shooting instruction is triggered, including: and when the gesture of at least one target subject is detected to be matched with any preset target gesture in the preset target gesture set, triggering a shooting instruction.

In one embodiment, when it is detected that the posture of at least one target subject matches a preset target posture, a shooting instruction is triggered, including: when the target body is detected to contain a plurality of target bodies, the postures of the target bodies are detected, and when the postures of the target bodies are detected to be matched with the preset target postures at the same time, a shooting instruction is triggered.

In one embodiment, the deep learning neural network model detects the posture of at least one target subject, and when the posture of the at least one target subject is detected to be matched with the preset target posture, the shooting instruction is triggered, including: detecting the posture of at least one target subject by using the deep learning neural network model to obtain the posture of the at least one target subject and a state parameter corresponding to the posture of the at least one target subject, wherein the change of the state parameter reflects the state change of the posture of the target subject corresponding to the state parameter; and when the gesture of at least one target body is matched with the preset target gesture and the state parameter of the gesture of the target body matched with the preset target gesture meets a preset threshold value, triggering a shooting instruction.

In one embodiment, when it is detected that the posture of at least one target subject matches a preset target posture, a shooting instruction is triggered, including: acquiring voice data of at least one target subject; detecting and identifying voice data to obtain a corresponding voice identification result; and when the gesture of at least one target body is matched with the preset target gesture and the voice recognition result is matched with the preset target voice data, triggering a shooting instruction.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of: the method comprises the steps of obtaining an image collected by an image collecting device, identifying at least one target main body in the image, continuously tracking the posture change of the at least one target main body, detecting the posture of the at least one target main body through a trained deep learning neural network model, and triggering a shooting instruction when the posture of the at least one target main body is matched with a preset target posture.

In one embodiment, before triggering the shooting instruction, the computer program further causes the processor to perform the steps of: continuously acquiring images acquired by the image acquisition device; and entering a posture change step of identifying at least one target subject in the image and continuously tracking the at least one target subject, triggering the shooting instruction again when detecting that the posture of the at least one target subject is matched with the preset target posture, and repeatedly entering a step of continuously acquiring the image acquired by the image acquisition device to finish continuous triggering shooting.

In one embodiment, the preset target gesture includes a plurality of preset target sub-gestures, and when it is detected that the gesture of the at least one target subject matches the preset target gesture, the shooting instruction is triggered, including: and when the gesture of at least one target main body is matched with the preset target sub-gestures, triggering a shooting instruction, and shooting to obtain sub-gesture pictures corresponding to the preset target sub-gestures.

In one embodiment, the preset target gesture set is composed of a plurality of preset target gestures, and when the gesture of at least one target subject is detected to match the preset target gesture, the shooting instruction is triggered, including: and when the gesture of at least one target subject is detected to be matched with any one preset target gesture in the preset target gesture set, triggering a shooting instruction.

In one embodiment, the deep learning neural network model detects the posture of at least one target subject, and when the posture of the at least one target subject is detected to be matched with the preset target posture, the shooting instruction is triggered, including: detecting the posture of at least one target subject by using the deep learning neural network model to obtain the posture of the at least one target subject and a state parameter corresponding to the posture of the at least one target subject, wherein the change of the state parameter reflects the state change of the posture of the target subject corresponding to the state parameter; and triggering a shooting instruction when the gesture of at least one target body is matched with the preset target gesture and the state parameter of the gesture of the target body matched with the preset target gesture meets a preset threshold value.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An image capture method, the method comprising:

acquiring an image acquired by an image acquisition device;

inputting a current image into a trained image recognition model, wherein the trained image recognition model acquires historical position information of a plurality of target bodies in a historical image corresponding to the current image;

for each target subject, the following processing is performed:

according to the historical position information and the motion information of the target subject, predicting a prediction position area of the target subject in the current image in a combined manner; when the target subject is detected in the range of the predicted position area, outputting the current position information of the target subject in the current image;

segmenting the target main body at the current image, inputting an image area containing the target main body into a trained deep learning neural network model for attitude feature extraction, and obtaining a target attitude feature set corresponding to the target main body; weighting each attitude feature in the attitude feature set of the target subject according to the weight of each attitude feature to obtain target state data; obtaining the posture of the target main body according to the corresponding relation between the target state data and the posture;

acquiring voice data sent by at least one photographed person through a voice acquisition device, wherein the voice data comprises text information matched with the posture of a target main body; detecting and recognizing the voice data to obtain a voice recognition result;

when the postures of the target bodies are detected to be matched with the initial postures at the same time and the voice recognition result is matched with preset target voice data, triggering a shooting instruction to continuously obtain pictures shot by the image acquisition device; and when the gesture of each target subject is matched with the termination gesture, stopping the image acquisition device from shooting the images, wherein the images continuously acquired from the starting gesture to the termination gesture form a video.

2. The method of claim 1, wherein before the step of inputting the image region containing the target subject into the trained deep learning neural network model for pose feature extraction, the method further comprises:

inputting a training image set carrying a posture label into a deep learning neural network model;

acquiring state data corresponding to the attitude tag;

taking the state data as an expected output result of the deep learning neural network model, and training the deep learning neural network model;

and updating the parameters of the deep learning neural network model to obtain the trained deep learning neural network model.

3. The method of claim 2, wherein the step of updating the parameters of the deep learning neural network model to obtain the trained deep learning neural network model comprises:

extracting the attitude characteristics of each image in the training image set to obtain a corresponding attitude characteristic set;

adjusting the weight of each attitude feature in the attitude feature set corresponding to each image;

weighting the corresponding attitude characteristics according to the weight of each attitude characteristic to obtain current state data;

when the current state data and the expected state data meet the convergence condition, obtaining the target weight of each corresponding attitude feature;

and obtaining parameters of the deep learning neural network model according to the target weight.

4. The method of claim 1, further comprising:

when the posture of the target subject is obtained by using the deep learning neural network model, state parameters corresponding to the posture of the target subject are also obtained, wherein the change of the state parameters reflects the state change of the posture of the target subject;

when the gestures of the target bodies are simultaneously matched with the initial gestures and the voice recognition result is matched with preset target voice data, triggering a shooting instruction, wherein the steps comprise:

and triggering the shooting instruction when the postures of the target bodies are matched with the initial posture at the same time, the state parameters of the postures of the target bodies meet a preset threshold value, and the voice recognition result is matched with preset target voice data.

5. An image capturing apparatus, characterized in that the apparatus comprises:

the target subject identification tracking module is used for inputting a current image into a trained image identification model, and the trained image identification model acquires historical position information of a plurality of target subjects in a historical image corresponding to the current image; for each target subject, the following processing is performed: according to the historical position information and the motion information of the target subject, predicting a prediction position area of the target subject in the current image in a combined manner; when the target subject is detected in the range of the predicted position area, outputting the current position information of the target subject in the current image;

the attitude detection module is used for segmenting each target main body at the current image, inputting an image area containing the target main body into a trained deep learning neural network model for extracting attitude characteristics to obtain a target attitude characteristic set corresponding to the target main body; weighting each attitude feature in the attitude feature set of the target subject according to the weight of each attitude feature to obtain target state data; obtaining the posture of the target main body according to the corresponding relation between the target state data and the posture;

the shooting module is used for acquiring voice data sent by at least one shot person through a voice acquisition device, wherein the voice data comprises character information matched with the posture of a target main body; detecting and recognizing the voice data to obtain a voice recognition result; when the postures of the target bodies are detected to be matched with the initial postures at the same time and the voice recognition result is matched with preset target voice data, triggering a shooting instruction to continuously obtain pictures shot by the image acquisition device; and when the gesture of each target subject is matched with the termination gesture, stopping the image acquisition device from shooting the images, wherein the images continuously acquired from the starting gesture to the termination gesture form a video.

6. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 4.

7. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 4.